Methods and apparatus for virtualized hardware optimizations for user space networking

ABSTRACT

Methods and apparatus for efficient data transfer within a user space network stack. Unlike prior art monolithic networking stacks, the exemplary networking stack architecture described hereinafter includes various components that span multiple domains (both in-kernel, and non-kernel). For example, unlike traditional “socket” based communication, disclosed embodiments can transfer data directly between the kernel and user space domains. Direct transfer reduces the per-byte and per-packet costs relative to socket based communication. A user space networking stack is disclosed that enables extensible, cross-platform-capable, user space control of the networking protocol stack functionality. The user space networking stack facilitates tighter integration between the protocol layers (including TLS) and the application or daemon. Exemplary systems can support multiple networking protocol stack instances (including an in-kernel traditional network stack).

PRIORITY

This application claims the benefit of priority to U.S. ProvisionalPatent Application Ser. No. 62/649,509 filed Mar. 28, 2018 and entitled“METHODS AND APPARATUS FOR EFFICIENT DATA TRANSFER WITHIN USER SPACENETWORKING STACK INFRASTRUCTURES”, which is incorporated herein byreference in its entirety.

RELATED APPLICATIONS

This application is related to U.S. patent application Ser. No.16/144,992 filed Sep. 27, 2018 and entitled “Methods and Apparatus forSingle Entity Buffer Pool Management”, U.S. patent application Ser. No.16/146,533 filed Sep. 28, 2018 and entitled “Methods and Apparatus forRegulating Networking Traffic in Bursty System Conditions”, U.S. patentapplication Ser. No. 16/146,324 filed Sep. 28, 2018 and entitled“Methods and Apparatus for Preventing Packet Spoofing with User SpaceCommunication Stacks”, U.S. patent application Ser. No. 16/146,916 filedSep. 28, 2018 and entitled “Methods and Apparatus for Channel DefunctWithin User Space Stack Architectures”, U.S. patent application Ser. No.16/236,032 filed Dec. 28, 2018 and entitled “Methods and Apparatus forClassification of Flow Metadata with User Space Communication Stacks”,U.S. patent application Ser. No. 16/363,495 filed Mar. 25, 2019 andentitled “Methods and Apparatus for Dynamic Packet Pool Configuration inNetworking Stack Infrastructures”, and U.S. patent application Ser. No.______, filed herewith on Mar. 26, 2019 and entitled “Methods andApparatus for Sharing and Arbitration of Host Stack Information withUser Space Communication Stacks”, each of the foregoing beingincorporated herein by reference in its entirety.

COPYRIGHT

A portion of the disclosure of this patent document contains materialthat is subject to copyright protection. The copyright owner has noobjection to the facsimile reproduction by anyone of the patent documentor the patent disclosure, as it appears in the Patent and TrademarkOffice patent files or records, but otherwise reserves all copyrightrights whatsoever.

1. Technical Field

The disclosure relates generally to the field of electronic devices, aswell as networks thereof. More particularly, the disclosure is directedto methods and apparatus for implementing computerized networking stackinfrastructures. Various aspects of the present disclosure are directedto, in one exemplary aspect, data transfer within user space networkingstack infrastructures.

2. Description of Related Technology

The consumer electronics industry has seen explosive growth in networkconnectivity; for example, Internet connectivity is now virtuallyubiquitous across many different device types for a variety of differentapplications and functionalities. The successful implementation ofnetwork connectivity over a myriad of different usage cases has beenenabled by, inter alia, the principles of modular design andabstraction. Specifically, the traditional network communicationparadigm incorporates multiple (generally) modular software “layers”into a “communication stack.” Each layer of the communication stackseparately manages its own implementation specific considerations, andprovides an “abstracted” communication interface to the next layer. Inthis manner, different applications can communicate freely acrossdifferent devices without considering the underlying network transport.

The traditional network communication paradigm has been relativelystable for over 30 years. The Assignee hereof has developed its ownimplementation of a computer networking stack (based on the traditionalnetworking paradigm) that is mature, robust, and feature-rich (yetconservative). This networking stack is the foundation for virtually allnetworking capabilities, including those used across the Assignee'sproducts (e.g., MacBook®, iMac®, iPad®, and iPhone®, etc.) and has beendesigned to handle a variety of protocols (such as TCP (TransmissionControl Protocol), UDP (User Datagram Protocol) and IP (InternetProtocol)), and proprietary extensions and functionalities.

While the traditional network communication paradigm has many benefits,changes in the commercial landscape have stretched the capabilities ofthe existing implementations. Over the past years new use cases haveemerged that require capabilities beyond those of the traditionalnetworking stack design. For example, some use cases require control anddata movement operations to be performed in so-called “user space”(software that is executed outside the kernel, and specific to a userprocess). Common examples of such applications include withoutlimitation e.g. Virtual Private Networks (VPN), application proxy,content and traffic filtering, and any number of other network-awareuser applications.

Furthermore, certain types of user applications (e.g., media playback,real-time or interactive network applications) would benefit fromworkload-specific customizations and performance optimizations of thenetworking stack.

Unfortunately, the current one-size-fits-all networking stack was notdesigned for (and is thus ill-suited to) the requirements of theaforementioned use cases (and others contemplated herein). Moredirectly, supporting user space applications and associated componentsfrom within the traditional in-kernel networking stack architecture addscomplexity, increases technical debts (the implied cost of reworkattributed to deploying a faster, but suboptimal, implementation),brings in higher processing costs, and results in suboptimal performanceand higher power consumption.

To these ends, a networking stack architecture and technology thatcaters to emerging non-kernel use cases is needed. Ideally, but not as arequisite, such solutions should preserve backwards compatibility withthe traditional in-kernel networking stack. More generally, improvedmethods and apparatus for manipulating and/or controlling lower layernetworking communication protocols by higher layer software applicationsis desired.

SUMMARY

The present disclosure satisfies the foregoing needs by providing, interalia, methods and apparatus for data transfer within user spacenetworking stack infrastructures.

In a first aspect of the disclosure, methods and apparatus to addresscontiguous or adjacent memory objects (which are prone to inadvertentmemory corruption due to buffer overrun issues) are disclosed, as aremethods and apparatus for a user to detect such issues. In oneembodiment, data descriptors also known as packet or quantum have ametadata preamble placed at the beginning of the object. This metadatapreamble is used to detect any inadvertent overwrite of the metadata. Inone variant, each metadata object has a unique red zone pattern which isthe XOR of a red zone cookie and the offset of the metadata object inthe object's memory region. Red zone cookies are initialized with randomnumbers on an OS boot. In the event the kernel detects a corruption, theuser space process associated with the channel is terminated to preventfurther damages.

In a second aspect, methods and apparatus for enhanced security aredisclosed. In one embodiment, the architecture maintains a mirrored copyof the packet descriptor memory which is accessible only from thekernel. During packet handoff from user-space to kernel, the useraccessible descriptor is validated (against the kernel copy) for anysemantic issues and the sanitized data is copied to the kernel mappeddescriptor.

In another aspect, methods and apparatus for access control on userspace network architecture ports to prevent unauthorized clients fromopening channels are disclosed. In one embodiment, an access controlmechanism is provided based on one or more attributes associated with achannel client, namely process ID, process executable's UUID, or keyblob. A process (e.g., Nexus provider) chooses to select one or acombination of those attributes for securing access to a port of a namedinstance.

In another aspect, methods and apparatus for RST flood detection andmitigation are disclosed. In one embodiment, mechanisms to prevent SYNflood and RST flood attacks originating from the user space stack areprovided. A USNSI flow-switch implements flow tracking logic which candetect SYN flood and RST flood to prevent these attacks originating froma device. If an attack is detected, the flow-switch will rate-limit theSYN and RST packets coming from the user space stack.

In another aspect, methods and apparatus for split TX and RX packetpools (direction-specific DMA access for security) are disclosed. In oneembodiment, buggy or hostile devices are prevented from usingPCIe-mapped buffers to attack the host, such as by overwriting thecontent of in-use buffers, or performing timing/time-of-use basedattacks. A USNSI setup maps segments to use the minimum possible memoryaccess permissions on receive and transmit packet buffers.

In another aspect, methods and apparatus for use of randomized memorysegment sizes are disclosed. As noted, buggy or hostile devices may usePCIe-mapped buffers to attack the host; to help mitigate thisvulnerability, the system will randomize the PCIe address spacemappings, to make it difficult for an attacker to find vulnerablehost-side resources. To help support this security protection, A USNSImay in one variant randomize its segment size by randomizing the numberof pages per segment at the time segments are allocated. In anothervariant, a USNSI may also randomize packet order within a segment, tomake it more difficult to correlate packet address to position within asegment.

In another aspect, methods and apparatus for Device TOCTOU attackmitigation are disclosed. In one embodiment, the process (e.g., Nexus)makes a kernel only copy before accessing device supplied data, allsubsequent “sanity” checks and uses on the data are carried out on thekernel only copy. Even if a compromised device launches TOCTOU attack,the kernel detects and uses the consistent kernel-only copy that is notaffected.

In another aspect, methods and apparatus for managing entitlements toaccess statistics and Nexus operations are disclosed. In one embodiment,entitlements checks for privileged operations are conducted only byprocesses possessing such entitlements, e.g. trusted processes.

In another aspect, methods and apparatus for leveraging RTT estimationdata for bounds checking are disclosed. RTT measurement is a criticalvalue for TCP operations such as retransmission and fast recovery;hence, in one embodiment, the TCP stack(s) are in user space, and thekernel also performs its own rough RTT using flow tracker in theflow-switch. To accept measurements from user space, the kernel conductsa “sanity” check with its estimated upper and lower bounds. Only the RTTsamples that passed the kernel sanity check could be published to otherTCP stack instances.

In another aspect, methods and apparatus for malicious statisticsdetection before folding into trusted statistics are disclosed. In oneembodiment, a process (e.g., the Nexus) also instantiates a shadowkernel-only statistics object in addition to the user space protocolstack instance shared statistics object. The kernel-only statisticsobject stores historical values of the user space protocol stackstatistics. Before accepting the user space protocol stack statistics,the Nexus derives a delta of each uTCP statistics snapshot with thehistorical value and conducts an anomaly detection. Also for criticalstatistics, such as cellular data usage, the USNSI in one variant onlyrelies on trusted flow-switch kernel observed statistics.

In another aspect, methods and apparatus for preventing IP Address/portspoofing are disclosed. In one embodiment, the TCP/IP stacks is/are inuser space, a flow-switch is used to performs a flow 5-tuple lookup inthe kernel with the registered flows before packets are transmitted;e.g., to make sure the sender has the 5-tuple registration. Any packetswith un-matching 5-tuple and various other metadata such as flow IDwould be dropped or otherwise handled.

In another aspect, methods and apparatus for trusted TFO & ECN aredisclosed. In one embodiment, a TCP stack that supports both TCP FastOpen (TFO) and Explicit Congestion Notification (ECN) is used; both theTCP options are enabled/disabled based on per network heuristicsmaintained on the system, so as to avoid using TFO and ECN on networksthat either do not support these options, or blacklist devices if theoptions are present in the TCP header.

In one variant, the ECN and TFO heuristics is updated each time a TCPconnection experiences a success or failure when using TFO or ECN, andthe USNSI TCP protocol stack runs in the user process's context, and allprocesses can indicate to the system heuristics a failure of TFO orECN—however, only processes that are trusted on the system can updatethe heuristics with TFO or ECN success. This prevents malicious appsfrom incorrectly updating TFO or ECN success on networks that do notsupport these options.

In another aspect, methods and apparatus for driver managed pool aredisclosed. In one embodiment, a system global packet buffer pool isobviated in favor of a packet buffer pool managed and owned by a driverthat can be dedicated for that driver, or shared among several drivers.The owner of the pool handles notifications to dynamically map and unmapthe pool's memory from its device IOMMU aperture. This same notificationcan also “wire/un-wire” the memory as needed. Read and write attributescan also be restricted on both the host and the device side based on theI/O transfer direction for added security.

In another aspect, methods and apparatus for multi-buflet descriptors(array) are disclosed. In one embodiment, jumbo frames are supported ina memory efficient manner; rather than always allocating enough memoryto hold the largest possible frame size, a packet can instead hold anarray of buflets, each buflet points to a fixed size block of memoryallocated from a pool. The binding between the buflets and a packet canbe formed on demand. This scheme allows, inter alia, a packet to have avariable number of buflets depending on the size of the payload. Thisalso makes it easier to support scatter-gather style DMA engines byhanding it buflets, which are uniform by nature.

In another aspect, methods and apparatus for segment-based IOMMU/DARTmapping are disclosed. In one embodiment, use of a look-up of an I/O busaddress is at least partly obviated in favor of use a memory segmentwhich is guaranteed to be a multiple of a page size as the smallestmemory unit for I/O mappings. Each memory segment is then divided intoseveral packet buffers. Only one I/O bus address lookup is required forall the packet buffers within that segment, and this I/O bus address canalso be cached within the segment object.

In another aspect, methods and apparatus for split metadata and buffermanagement are disclosed. In one embodiment, exposing packet metadata tothe hardware such as Wi-Fi chips and cellular baseband is obviated infavor of use of different memory regions for the packet metadata and thepacket buffers to prevent malicious hardware from accessing the packetmetadata. In one variant, only the packet buffers are I/O mapped andvisible to the device.

In another aspect, methods and apparatus for a user packet pool aredisclosed. In one embodiment, an efficient scheme is provided thatenables dynamic scale-up and down of a memory available to each processaccording to the current throughput requirements. A user packet pool isused in one variant to achieve this; it attempts to reuse the efficientpacket I/O mechanism to move memory buffers across kernel-user boundaryand utilizes channel synchronization statistics to dynamically scale theamount of memory available to each channel.

In another aspect, methods and apparatus for user pipe dynamic memorymanagement using sync statistics are disclosed. In one embodiment, auser pipe process (e.g., Nexus) provides an efficient inter-processorcommunication (IPC) between user space processes using shared memory.Since the number of processes using IPC on an iOS device can besignificant, an efficient mechanism is provided so as to keep the sharedmemory usage to minimum without compromising on the data throughput. Inone variant, maintaining a fair estimate of immediate memory usage ofuser (working set) depending on the recent past usage is performed; theuser pipe Nexus maintains a weighted moving average statistics of memoryused during each synchronization, and can also keep adjusting thechannel memory accordingly as needed.

In another aspect, methods and apparatus for purgeable memory(compressible and swappable) are disclosed. In one embodiment, the USNSIarchitecture allocates all memory as purgeable and wires memory ondemand when needed.

In another aspect, methods and apparatus for memory region/arena:purpose, layout, access protection, sharing model are disclosed. In oneembodiment, an efficient and generic mechanism to represent and managethe shared memory objects of varying types and sizes which are memorymapped to the user space and/or kernel space is disclosed. In one suchembodiment, the USNSI architecture uses shared memory for efficientpacket I/O, network statistics and system attributes (sysctl). USNSIarena is a generic and efficient mechanism to represent these varioustypes of shared memory subsystems and their backing memory caches,regions and access protection attributes. Channel schema is arepresentation of the shared memory layout for user space process to beable to efficiently access various channel objects.

In another aspect, methods and apparatus for mirrored memory regions aredisclosed. In one embodiment, to implement security validation andsanitation of shared memory objects on user-kernel boundary, akernel-only copy of these objects is maintained, and an efficient methodto allocate and retrieve these objects is provided. In one variant,mirrored memory object(s) is/are created, which share the same regionoffset as that of the associated object and hence can be retrievedquickly from the attributes of the associated object.

In another aspect, methods and apparatus for channeling defunct (mapoverrides) are disclosed. In one embodiment, networking memoryassociated with a process when it is backgrounded is freed; redirectionof the shared memory mapping of the task so that they are backed withanonymous (zero-filled) pages is used to free the underlying memory.When the task is resumed, the user space shared memory accessorfunctions (e.g., libsyscall wrappers) have the logic to detect adefuncted state of the shared memory, and efficiently and effectivelyhandle errors due to data inconsistencies.

In another aspect, methods and apparatus for conducting one or more“reaps” based on idleness are disclosed. In one embodiment, efficientand aggressive pruning and purging of idle resources are utilized via,inter alia, mechanisms which can detect idle resources and can offloadpruning and purging of these resources in a deferred context.

In another aspect, methods and apparatus for management of daemon“jetsam” are disclosed. In one embodiment, a memory management modulethat keeps track of the memory consumed by the network protocols isprovided; depending on memory usage, the module indicates to the systemthat active work is being performed by the protocols on behalf of theapplication. Once the buffers are returned to the memory managementmodule, the module indicates to the system that the active work iscomplete. This prevents the system from targeting processes that consumemore memory while doing active work.

In another aspect, methods and apparatus for TCP memory “defunct”management are disclosed. In one embodiment, data inconsistency issueswhen a channel is defunct during processing of TCP packet are avoided byuse of a shadow copy of the original TCP header in heap memory. Once TCPprocessing begins, it uses the copy of the TCP header to make decisionswhich prevents any inconsistency or data corruption. The validation isdone prior to handing off the payload data to the layer above TCP, aswell as within the TCP input processing paths.

In another aspect, methods and apparatus for flow classification aredisclosed. In one embodiment, USNSI packets have a struct_flow as partof packet metadata which contains most information that those layersneed, and it is carried into BSD/user space, etc. The contents of thisstructure are filled once by the flow-switch.

In another aspect, methods and apparatus for flow management aredisclosed. In one embodiment, flow lifecycle, e.g. flow creation,destroy, which interfaces with calls/events from other components, ismanaged. In one variant, a flow manager is the entity that provides suchinterface. It accepts calls to create/destroy/defunct flows. It alsoshuts down flow when the flow owner process exits. This allows properclean-ups to be done regardless of how the process terminates.

In another aspect, methods and apparatus for flow entry management aredisclosed. In one embodiment, a mechanism to facilitate efficient packetforwarding within a USNSI flow-switch includes packet forwarding basedon the entries of a flow table, which allows facilitation of optimalforwarding data plane logic, where e.g., a multiple network interfaceNexus is fused together to form a direct conduit for sending packets toone another. In another aspect, methods and apparatus for flow actionmanagement are disclosed. In one embodiment, a flow-switch flow carriesaction on packets for a given flow; a mechanism is disclosed wherebypossible actions that can be applied to packet, e.g. forward to aflow-switch port to user space protocol stack, forward to BSD stack,drop, transform, etc., are defined. This approach allows for, interalia, an efficient way to apply traffic rules without involving separatedatabase lookups.

In another aspect, methods and apparatus for flow route management aredisclosed. In one embodiment, a USNSI flow route comprises a cachearound those BSD info, such that for USNSI flow packets can find thoseinformation within USNSI context along with flow lookup. The flow routeis notified when related events happen, e.g. route change, ARP expire,to maintain consistency. The flow routes allow for packets going out ofthe system via USNSI channels to not incur per-packet routing tablelookup costs.

In another aspect, methods and apparatus for flow tracking aredisclosed. In one embodiment, a process (e.g., flow-switch) has a flowtracker that passively tracks flow state/statistics during flowclassification and classifier. It provides KPI for other component toquery flow states and statistics. It also implements pro-active actionsin cleaning up flows that are, e.g., deemed to be terminated (by bothends) and not expecting any more data.

In another aspect, methods and apparatus for achieving low latency forurgent packets using flow tracking are disclosed. In one embodiment,urgent packets like DNS queries/TCP control packets are identified andprocessed (e.g., via a flush/notify) when detected to ensure we deliverthem with low latency. This allows for, inter alia, dynamic adjustmentof the notifications posted to the user space process depending on thecontents of the packets.

In another aspect, methods and apparatus for flow purging/“defunct-ing”are disclosed. In one embodiment, a flow tracker passively updates flowstate; a process (e.g., flow-switch) actively scans through all flowsand find those dead flows and close them; and orthogonally duringdefunct, a process (e.g., assertion) calls in to defunct flows when atarget process goes suspended.

In another aspect, methods and apparatus for dynamic growing/shrinkingof flow-switch ports are disclosed. In one embodiment, a process (e.g.,flow-switch) breaks up port space into small, contiguous chunks andmanages in that level or granularity; data structures are grown andshrunk on demand. This allows for, inter alia, sparse port usage.

In another aspect, methods and apparatus for sharing of packet poolamong trusted ports is disclosed are disclosed. In one embodiment, oneor more packet pools are configured to be shared across process, e.g.between two processes, between kernel and trusted first-party apps.Thus, packet movement doesn't require copying, and this allows forzero-copy data transfers between any of the entities in instances wherethe configuration allow for such.

In another aspect, methods and apparatus for efficient copy-checksummechanism being used in e.g., a process/user space stack are disclosed.In one embodiment, the flow-switch presents to the user space protocolstack as a virtual network port, which provides similar functions thattoday's hardware network device provides, e.g. checksum offloading, etc.For a flow-switch process, the copying is inherently necessary forsecurity reason when transitioning between trust domains (e.g. from userspace to kernel). Thus, in one variant, a combined copy and checksum ofthe data is used, so the user space does not need to scan through thedata and compute a checksum.

In another aspect, methods and apparatus for IP fragment management aredisclosed. In one embodiment, a lightweight packet reassemble forchannel (as if perfect network condition) is used, wherein a process(e.g., flow-switch) first accumulates all fragments as they come (e.g.,using IP address and IP ID, per IP reassembly RFC), then performs asingle flow lookup, and then delivery of all fragments to user space. Tothe user space protocol stack point of view, the flow-switch provides anin-sequence delivery network abstraction, which inter alia, facilitateshandling/receiving of fragments in a user space protocol stack.

In another aspect, methods and apparatus for user space stack flowcontrol are disclosed. In one embodiment, a user space TCP/IP stackarchitecture is used wherein the stack instance and the network driverare operating in different domains (user space & kernel space). Anefficient mechanism is provided for the user space stack to determinethe admissibility state of a given flow in the stack instance. In onevariant, USNSI channels provide a flow advisory table in shared memorywhich is updated by the kernel and consulted by the user space stack toflow control a given flow. In essence, this table provides admissioncontrol information to the user space stack.

In another aspect, methods and apparatus for user space stack flowadvisory are disclosed. In one embodiment, an efficient mechanism isprovided to signal the user space stack from kernel space to “flowcontrol” or “resume” a given flow in the stack instance. In one variant,USNSI channels utilize a kernel event mechanism with a specific type toindicate the user space stack about any updates to the flow advisorystate in kernel which is reflected in the “flow advisory table”maintained in shared memory. Each row in the table representsinformation about the flow, as well as the advisory state (e.g.flow-controlled, etc.) In another aspect, methods and apparatus for userspace stack schema are disclosed.

In one embodiment, a common AQM (Active Queue Management) functionalityis provided for a network interface hosting multiple and differing stackinstances (user space protocol stack and Kernel protocol stack); a USNSIprocess is a common entry point for the in-kernel BSD stack and the userspace stack. In one variant, a flow-switch nexus handles the differentpacket descriptor schemes and converts them to the packet descriptorscheme being used by the underlying network driver before enqueuing thepackets to the AQM queues. It also implements the appropriate mechanismsto provide flow control and advisory feedback from the AQM queues to thedifferent stack instances.

In another aspect, methods and apparatus for host stack coexistence andNetNS for port tuple arbitration are disclosed. In one embodiment, anefficient mechanism to share and arbitrate e.g., the 5-tuple networknamespace (i.e. access to use which port on which source address, etc.)is provided. In one variant, a USNSI architecture implements a sharednamespace manager (NetNS) to enable sharing and arbitration of thenetwork namespace between the various stack instances.

In another aspect, methods and apparatus for host stack coexistence aredisclosed. In one embodiment, a USNSI leverages existing functions in aBSD stack to handle those types of packets. A process (e.g.,flow-switch), when seeing those packets, forward them to BSD stack, andthe USNSI then registers callbacks for events from those BSD stacks, aswell as query information for its flow management, etc.

In another aspect, methods and apparatus for system-wide sysctl viashared memory (RO) are disclosed. In one embodiment, a USNSI implementsa system-wide sysctl shared memory region shared by all processes tominimize memory usage; in one variant, it is controllable by user viasysctl command to allow easy tuning, and is readable and controllable bykernel network stack if needed.

In another aspect, methods and apparatus for leveraging shared memoryfor user space stack management and statistics are disclosed.

In another aspect, methods and apparatus for a trusted RTT estimationbased on passive observation are disclosed. In one embodiment, a userspace protocol stack is provided a feedback mechanism to tell the kernelabout its packet processing state, e.g. the processing time of eachpacket as compared to kernel protocol stack. A flow tracker passivelyand selectively timestamp TCP packets and computes the processing timeof RX packets and network latency of TX packets. This information iskept in the flow entry for bounds checking and scheduler hint, as wellas diagnostic purpose.

In another aspect, methods and apparatus for implementing packet hooks(e.g., NLC v2 (NetEm)) are disclosed. In one embodiment, a NetEm packetscheduler on Rx/TX is used to simulate those networking conditions, tosimulate hardware features, etc. This is done by leveraging a USNSI'sbuilt-in infrastructures, e.g. pre- and post-sync and notify operationson the rings/queues.

In another aspect, methods and apparatus for header compression anddecompression are disclosed.

In another aspect, methods and apparatus for batching optimizations ine.g., a Bluetooth daemon are disclosed. In one embodiment, the methodsand apparatus reduce the per-packet cost for Bluetooth communication viapacket batching heuristics in a Bluetooth user space driver toefficiently move packet batches over USNSI channels, to/from agentprocesses, as well as to/from kernel UART HW driver.

In another aspect, methods and apparatus for replacing socket-based IPCwith channel are disclosed.

In another aspect, methods and apparatus for a mitigation thread dynamicthreshold table are disclosed. In one embodiment, an interruptmitigation scheme helping to reduce the interrupt processing load whilepreserving low latency and throughput is described.

In another aspect, methods and apparatus for Using RX mitigation and RXring size to normalize packet flow in bursty cellular conditions aredisclosed. In one embodiment, the bursty packet load at the networkinterface is normalized by adjusting the mitigation logic thresholds andinput queue size to get a uniform throughput in bursty scenarios.

In another aspect, methods and apparatus for closed loop schedulingusing user space protocol stack packet processing time to optimize userstack latency are disclosed. In one embodiment, an RTT estimationtechnique built in flow-switch is used to track the user stackprocessing time and form a close loop along with scheduler and CPUfrequency adjuster. The closed loop controller gets input fromflow-switch local RTT (user space network stack processing time)estimation, CPU frequency and process scheduling properties, the outputis next CPU frequency and process priority.

In another aspect, methods and apparatus for a submission/completionqueue driver are disclosed. In one embodiment,

Device drivers require a common and flexible queueing model in thedevice driver abstraction layer for packet I/O. The queues hide theunderlying USNSI rings, and also reduces the locking contention betweenthe driver work loop and the USNSI threads.

In another aspect, methods and apparatus for receivesubmission/completion queues that work with buffers instead of packetsare disclosed.

In another aspect, methods and apparatus for driver doorbell managementand refill are disclosed. In one embodiment, a doorbell notifies thedriver layer when one or more packet is available; a USNSI Familyqueries the driver for the amount of free space available, in in eitherpackets or bytes. A refill operation is then requested with this freespace information which will dequeue a bounded amount of packets fromthe AQM queue and pass them along to the drivers ring/queue forimmediate consumption.

In another aspect, methods and apparatus for queue-level reporting fornetwork scheduling are disclosed. In one embodiment, In another aspect,methods and apparatus for providing possible data transmissionopportunities (enabling efficient resource allocation) are disclosed.

In another aspect, methods and apparatus for kernel bypass, includingtransparent security (IPsec) gateway, are disclosed. In one embodiment,a USNSI will allow most IPsec components to be in user space. Installingnew components will only require restarting the user space IPsecforwarding daemon. In addition, the user space transformation planeallows for significantly better performance due to the elimination ofcosts associated with in-kernel design and implementation.

In another aspect, methods and apparatus for bridging, forwarding androuting are disclosed.

In another aspect, methods and apparatus for “tapping” on any channel(e.g., libpcap/tcpdump) are disclosed.

In another aspect, methods and apparatus for a test user space TCP stackare disclosed.

In another aspect, methods and apparatus for Nexus statistics (e.g.,flow-switch statistics)/Channel statistics (Ring statistics/Syncstatistics)/Flow statistics are disclosed.

In another aspect, methods and apparatus for a scheduling hint added toTCP RTT are disclosed.

In another aspect, a computerized device implementing one or more of theforegoing aspects is disclosed and described. In one embodiment, thedevice comprises a personal or laptop computer. In another embodiment,the device comprises a mobile device (e.g., tablet or smartphone).

In another aspect, an integrated circuit (IC) device implementing one ormore of the foregoing aspects is disclosed and described. In oneembodiment, the IC device is embodied as a SoC (system on Chip) device.In another embodiment, an ASIC (application specific IC) is used as thebasis of the device. In yet another embodiment, a chip set (i.e.,multiple ICs used in coordinated fashion) is disclosed.

In another aspect, a computer readable storage apparatus implementingone or more of the foregoing aspects is disclosed and described. In oneembodiment, the computer readable apparatus comprises a program memory,or an EEPROM. In another embodiment, the apparatus includes a solidstate drive (SSD) or other mass storage device. In another embodiment,the apparatus comprises a USB or other “flash drive” or other suchportable removable storage device. In yet another embodiment, theapparatus comprises a “cloud” (network) based storage device which isremote from yet accessible via a computerized user or client electronicdevice.

In yet another aspect, a software architecture is disclosed. In oneembodiment, the architecture includes both user space and kernel space,separated via a software or virtual partition.

In one aspect, a method for copy and checksum optimizations with userspace communication stacks is disclosed. In one exemplary embodiment,the method includes: configuring a first user space application withpass-through checksum functionality; reading data from a first pool ofresources associated with the first user space application; calculatinga checksum value based on the data; and storing the data in a secondpool of resources associated with a hardware driver.

In one variant, reading the data comprises reading a plurality of wordsegments. In one such variant, calculating a checksum value based on thedata comprises a running summation of the plurality of word segments.Additionally, the variant may store the data in the second pool ofresources comprises storing the checksum value. In some implementations,reading the data from the first pool of resources is performed by akernel space process. In some implementations, calculating the checksumvalue is performed by a kernel space process.

In one variant, the hardware driver is configured for a networkinterface card. In one such variant, the hardware driver does notprovide checksum functionality.

In one aspect, a system configured for copy and checksum optimizationswith user space communication stacks is disclosed. In one embodiment,the system includes: an application that comprises a user spacecommunication stack; a first pool of dedicated memory resources for theapplication; a second pool of dedicated memory resources for a kernelspace hardware driver; a kernel space flow-switch configured tocopy-checksum data from the first pool of dedicated memory resources tothe second pool of dedicated memory resources; and kernel space logic.In one exemplary embodiment, the kernel space logic is configured to:read data from the first pool of dedicated memory resources; calculate achecksum value based on the data; and store the data in the second poolof dedicated memory resources.

In one variant, the kernel space hardware driver comprises a networkinterface card. In one such variant, the network interface card isconfigured to transmit IP data. In one such variant, the networkinterface card does not include a checksum functionality. In one suchvariant, the network interface card operates without a checksumfunctionality. Additionally, the user space communication stack mayoperate without the checksum functionality. In fact, the user spacecommunication stack may operate in a pass-through mode.

In one variant, the kernel space logic is configured to read data fromthe first pool of dedicated memory resources in word segments. In onesuch variant, the kernel space logic is configured to calculate thechecksum from the word segments.

In another variant, the kernel space logic is prioritized over userspace logic.

In one aspect, a non-transitory computer readable apparatus comprising astorage medium having one or more computer programs stored thereon isdisclosed. In one exemplary embodiment, the one or more computerprograms, when executed by a processing apparatus are configured to:read one word of data from a first pool of memory; calculate a checksumvalue based on the one word of data; and store the one word of data in asecond pool of memory.

In one variant, the first pool of memory is dedicated to a firstapplication comprising hardware driver, the hardware driver receivingdata for a user space networking stack.

Other features and advantages of the present disclosure will immediatelybe recognized by persons of ordinary skill in the art with reference tothe attached drawings and detailed description of exemplary embodimentsas given below.

All figures © Copyright 2017-2019 Apple Inc. All rights reserved.

DETAILED DESCRIPTION

Reference is now made to the drawings, wherein like numerals refer tolike parts throughout.

Detailed Description of Exemplary Embodiments

Exemplary embodiments of the present disclosure are now described indetail. While embodiments are primarily discussed in the context of usein conjunction with an inter-processor communication (IPC) link such asthat described in, for example, commonly owned U.S. patent applicationSer. No. 14/879,024 filed Oct. 8, 2015 and entitled “METHODS ANDAPPARATUS FOR RUNNING AND BOOTING AN INTER-PROCESSOR COMMUNICATION LINKBETWEEN INDEPENDENTLY OPERABLE PROCESSORS”, now U.S. Pat. No.10,078,361, and co-owned and co-pending U.S. patent application Ser. No.16/112,480 filed Aug. 24, 2018 and entitled “METHODS AND APPARATUS FORCONTROL OF A JOINTLY SHARED MEMORY-MAPPED REGION”, each of which beingincorporated herein by reference in its entirety, it will be recognizedby those of ordinary skill that the present disclosure is not solimited.

Existing Network Socket Technologies—

FIG. 1 illustrates one logical representation of a traditional networksocket 102, useful for explaining various aspects of the traditionalnetworking interface. A network “socket” is a virtualized internalnetwork endpoint for sending or receiving data at a single node in acomputer network. A network socket may be created (“opened”) ordestroyed (“closed”) and the manifest of network sockets may be storedas entries in a network resource table which may additionally includereference to various communication protocols (e.g., Transmission ControlProtocol (TCP) 104, User Datagram Protocol (UDP) 106, Inter-ProcessorCommunication (IPC) 108, etc.), destination, status, and any otheroperational processes (kernel extensions 112) and/or parameters); moregenerally, network sockets are a form of system resource.

As shown in FIG. 1, the socket 102 provides an application programminginterface (API) that spans between the user space and the kernel space.An API is a set of clearly defined methods of communication betweenvarious software components. An API specification commonly includes,without limitation: routines, data structures, object classes,variables, remote calls and/or any number of other software constructscommonly defined within the computing arts.

As a brief aside, user space is a portion of system memory that aprocessor executes user processes from. User space is relatively freelyand dynamically allocated for application software and a few devicedrivers. The kernel space is a portion of memory that a processorexecutes the kernel from. Kernel space is strictly reserved (usuallyduring the processor boot sequence) for running privileged operatingsystem (O/S) processes, extensions, and most device drivers. Forexample, each user space process normally runs in a specific memoryspace (its own “sandbox”), and cannot access the memory of otherprocesses unless explicitly allowed. In contrast, the kernel is the coreof a computer's operating system; the kernel can exert complete controlover all other processes in the system.

The term “operating system” may refer to software that controls andmanages access to hardware. An O/S commonly supports processingfunctions such as e.g., task scheduling, application execution, inputand output management, memory management, security, and peripheralaccess. As used herein, the term “application” refers to software thatcan interact with the hardware only via procedures and interfacesoffered by the O/S.

The term “privilege” may refer to any access restriction or permissionwhich restricts or permits processor execution. System privileges arecommonly used within the computing arts to, inter alia, mitigate thepotential damage of a computer security vulnerability. For instance, aproperly privileged computer system will prevent malicious softwareapplications from affecting data and task execution associated withother applications and the kernel.

As used herein, the term “in-kernel” and/or “kernel space” may refer todata and/or processes that are stored in, and/or have privilege toaccess to, the kernel space memory allocations. In contrast, the terms“non-kernel” and/or “user space” refers to data and/or processes thatare not privileged to access the kernel space memory allocations. Inparticular, user space represents the address space specific to the userprocess, whereas non-kernel space represents address space which is notin-kernel, but which may or may not be specific to user processes.

As previously noted, the illustrated socket 102 provides access toTransmission Control Protocol (TCP) 104, User Datagram Protocol (UDP)106, and Inter-Processor Communication (IPC) 108. TCP, UDP, and IPC arevarious suites of transmission protocols each offering differentcapabilities and/or functionalities. For example, UDP is a minimalmessage-oriented encapsulation protocol that provides no guarantees tothe upper layer protocol for message delivery and the UDP layer retainsno state of UDP messages once sent. UDP is commonly used for real-time,interactive applications (e.g., video chat, voice over IP (VoIP)) whereloss of packets is acceptable. In contrast, TCP provides reliable,ordered, and error-checked delivery of data via a retransmission andacknowledgement scheme; TCP is generally used for file transfers wherepacket loss is unacceptable, and transmission latency is flexible.

As used herein, the term “encapsulation protocol” may refer to modularcommunication protocols in which logically separate functions in thenetwork are abstracted from their underlying structures by inclusion orinformation hiding within higher level objects. For example, in oneexemplary embodiment, UDP provides extra information (ports numbering).

As used herein, the term “transport protocol” may refer to communicationprotocols that transport data between logical endpoints. A transportprotocol may include encapsulation protocol functionality.

Both TCP and UDP are commonly layered over an Internet Protocol (IP) 110for transmission. IP is a connectionless protocol for use onpacket-switched networks that provides a “best effort delivery”. Besteffort delivery does not guarantee delivery, nor does it assure propersequencing or avoidance of duplicate delivery. Generally these aspectsare addressed by TCP or another transport protocol based on UDP.

As a brief aside, consider a web browser that opens a webpage; the webbrowser application would generally open a number of network sockets todownload and/or interact with the various digital assets of the webpage(e.g., for a relatively common place webpage, this could entailinstantiating ˜300 sockets). The web browser can write (or read) data tothe socket; thereafter, the socket object executes system calls withinkernel space to copy (or fetch) data to data structures in the kernelspace.

As used herein, the term “domain” may refer to a self-contained memoryallocation e.g., user space, kernel space. A “domain crossing” may referto a transaction, event, or process that “crosses” from one domain toanother domain. For example, writing to a network socket from the userspace to the kernel space constitutes a domain crossing access.

In the context of a Berkeley Software Distribution (BSD) basednetworking implementation, data that is transacted within the kernelspace is stored in memory buffers that are also commonly referred to as“mbufs”. Each mbuf is a fixed size memory buffer that is usedgenerically for transfers (mbufs are used regardless of the callingprocess e.g., TCP, UDP, etc.). Arbitrarily sized data can be split intomultiple mbufs and retrieved one at a time or (depending on systemsupport) retrieved using “scatter-gather” direct memory access (DMA)(“scatter-gather” refers to the process of gathering data from, orscattering data into, a given set of buffers). Each mbuf transfer isparameterized by a single identified mbuf.

Notably, each socket transfer can create multiple mbuf transfers, whereeach mbuf transfer copies (or fetches) data from a single mbuf at atime. As a further complication, because the socket spans both: (i) userspace (limited privileges) and (ii) kernel space (privileged withoutlimitation), the socket transfer verifies that each mbuf copy into/outof kernel space is valid. More directly, the verification processensures that the data access is not malicious, corrupted, and/ormalformed (i.e., that the transfer is appropriately sized and is to/froman appropriate area).

The processing overhead associated with domain crossing is a non-trivialprocessing cost. Processing cost affects user experience both directlyand indirectly. A processor has a fixed amount of processing cyclesevery second; thus cycles that are used for transfer verificationdetract from more user perceptible tasks (e.g., rendering a video oraudio stream). Additionally, processor activity consumes power; thus,increases in processing overhead increases power consumption.

Referring back to FIG. 1, in addition to the generic TCP 104, UDP 106,and IPC 108 communication suites, the illustrated socket 102 also mayprovide access to various kernel extensions 112. A kernel extension is adynamically loaded bundle of executable code that executes from kernelspace. Kernel extensions may be used to perform low-level tasks thatcannot be performed in user space. These low-level tasks typically fallinto one or more of: low-level device drivers, network filters, and/orfile systems. Examples of sockets and/or extensions include withoutlimitation: route (IP route handling), ndrv (packet 802.1X handling),key (key management), unix (translations for Unix systems), kernelcontrol, kernel events, parental controls, intrusion detection, contentfiltering, hypervisors, and/or any number of other kernel tasking.

Kernel extensions and public APIs enable, for example, 3^(rd) partysoftware developers to develop a wide variety of applications that caninteract with a computer system at even the lowest layers ofabstraction. For example, kernel extensions can enable socket levelfiltering, IP level filtering, and even device interface filtering. Inthe current consumer applications space, many emerging technologies nowrely on closely coupled interfaces to the hardware and kernelfunctionality. For example, many security applications “sniff” networktraffic to detect malicious traffic or filter undesirable content; thisrequires access to other application sandboxes (a level of privilegethat is normally reserved for the kernel).

Unfortunately, 3^(rd) party kernel extensions can be dangerous and/orundesirable. As previously noted, software applications are restrictedfor security and stability reasons; however the kernel is largelyunrestricted. A 3^(rd) party kernel extension can introduce instabilityissues because the 3rd party kernel extensions run in the same addressspace as the kernel itself (which is outside the purview of traditionalmemory read/write protections based on memory allocations). Illegalmemory accesses can result in segmentation faults and memorycorruptions. Furthermore, unsecure kernel extension can create securityvulnerabilities that can be exploited by malware. Additionally, evenwhere correctly used, a kernel extension can expose a user's data to the3^(rd) party software developer. This heightened level of access mayraise privacy concerns (e.g., the 3^(rd) party developer may have accessto browsing habits, etc.).

Existing Performance Optimization Technologies—

FIG. 2 illustrates one logical representation of a computer system thatimplements Input/Output (I/O) network control, useful for explainingvarious aspects of traditional network optimization. As depictedtherein, a software application 202 executing from user space opensmultiple sockets 204 to communicate with e.g., a web server. Each of thesockets interfaces with a Data Link Interface Layer (DLIL) 206.

The DLIL 206 provides a common interface layer to each of the variousphysical device drivers which will handle the subsequent data transfer(e.g., Ethernet, Wi-Fi, cellular, etc.). The DLIL performs a number ofsystem-wide holistic network traffic management functions. In one suchimplementation, the DLIL is responsible for BSD Virtual Interfaces,IOKit Interfaces (e.g., DLIL is the entity by which IOKit based networkdrivers are connected to the networking stack), Active Queue Management(AQM), flow control and advisory action, etc. In most cases, the devicedriver 208 may be handled by an external device (e.g., a basebandco-processor), thus the DLIL 206 is usually (but not always) the lowestlayer of the network communication stack.

During normal operation, the computer system will logically segment itstasks to optimize overall system operation. In particular, a processorwill execute a task, and then “context switch” to another task, therebyensuring that any single process thread does not monopolize processorresources from start to finish. More directly, a context switch is theprocess of storing the state of a process, or of a thread, so that itcan be restored and execution resumed from the same point later. Thisallows multiple processes to share a single processor. However,excessive amounts of context switching can slow processor performancedown. Notably, while the present discussion is primarily discussedwithin the context of a single processor for ease of understanding,multi-processor systems have analogous concepts (e.g., multipleprocessors also perform context switching, although contexts may notnecessarily be resumed by the same processor).

For example, consider the following example of a packet reception.Packets arrive at the device driver 208A. The hardware managed by thedevice driver 208A may notify the processor via e.g., a doorbell signal(e.g., an interrupt). The device driver 208A work loop thread handlesthe hardware interrupt/doorbell, then signals the DLIL thread (Loop 1210). The processor services the device driver 208A with high priority,thereby ensuring that the device driver 208A operation is notbottlenecked (e.g., that the data does not overflow the device driver'smemory and/or that the device driver does not stall). Once the data hasbeen moved out of the device driver, the processor can context switch toother tasks.

At a later point, the processor can pick up the DLIL 206 executionprocess again. The processor determines which socket the packets shouldbe routed to (e.g., socket 204A) and routes the packet dataappropriately (Loop 2 212). During this loop, the DLIL thread takes eachpacket, and moves each one sequentially into the socket memory space.Again, the processor can context switch to other tasks so as to ensurethat the DLIL task does not block other concurrently executedprocessing.

Subsequently thereafter, when the socket has the complete packet datatransfer the processor can wake the user space application and deliverthe packet into user space memory (Loop 3 214). Generally, user spaceapplications are treated at lower priority than kernel tasks; this canbe reflected by larger time intervals between suspension and resumption.While the foregoing discussion is presented in the context of packetreception, artisans of ordinary skill in the related arts will readilyappreciate, given the contents of the present disclosure, that theprocess is substantially reversed for packet transmission. Asdemonstrated in the foregoing example, context switching ensures thattasks of different processing priority are allocated commensurateamounts of processing time. For example, a processor can spendsignificantly more time executing tasks of relatively high priority, andservice lower priority tasks on an as-needed basis. As a brief aside,human perception is much more forgiving than hardware operation.Consequently, kernel tasks are generally performed at a much higherpriority than user space applications. The difference in prioritybetween kernel and user space allows the kernel to handle immediatesystem management (e.g., hardware interrupts, and queue overflow) in atimely manner, with minimal noticeable impact to the user experience.

Moreover, FIG. 2 is substantially representative of every implementationof the traditional network communications stack. While implementationsmay vary from this illustrative example, virtually all networking stacksshare substantially the same delivery mechanism. The traditional networkcommunications stack schema (such as the BSD architecture andderivatives therefrom) have been very popular for the past 30 years dueto its relative stability of implementation and versatility across manydifferent device platforms. For example, the Assignee hereof hasdeveloped and implemented the same networking stack across virtually allof its products (e.g., MacBook®, iMac®, iPad®, and iPhone®, AppleWatch®, etc.).

Unfortunately, changing tastes in consumer expectations cannot beeffectively addressed with the one-size-fits-all model and theconservative in-kernel traditional networking stack. Artisans ofordinary skill in the related arts will readily appreciate, given thecontents of the present disclosure, that different device platforms havedifferent capabilities; for example, a desktop processor hassignificantly more processing and memory capability than a mobile phoneprocessor. More directly, the “one-size-fits-all” solution does notaccount for the underlying platform capabilities and/or applicationrequirements, and thus is not optimized for performance. Fine-tuning thetraditional networking stack for performance based on various “tailored”special cases results in an inordinate amount of software complexitywhich is untenable to support across the entire ecosystem of devices.

Emerging Use Cases

FIG. 3 illustrates a logical block diagram of one exemplaryimplementation of Transport Layer Security (TLS) (the successor toSecure Sockets Layer (SSL)), useful to explain user/kernel spaceintegration complexities of emerging use cases.

As shown, an application executing from user space can open a HypertextTransfer Protocol (HTTP) session 302 with a TLS security layer 304 inorder to securely transfer data (Application Transport Security (ATS)services) over a network socket 306 that offers TCP/IP transport 308,310.

As a brief aside, TLS is a record based protocol; in other words, TLSuses data records which are arbitrarily sized (e.g., up to 16kilobytes). In contrast, TCP is a byte stream protocol (i.e., a byte hasa fixed length of eight (8) bits). Consequently, the TCP layersubdivides TLS records into a sequentially ordered set of bytes fordelivery. The receiver of the TCP byte stream reconstructs TLS recordsfrom the TCP byte stream by receiving each TCP packet, re-ordering thepackets according to sequential numbering to recreate the byte stream,and extracting the TLS record from the aggregated byte stream. Notably,every TCP packet of the sequence must be present before the TLS recordcan be reconstructed. Even though TCP can provide reliable deliveryunder lossy network conditions, there are a number of situations whereTLS record delivery could fail. For example, under ideal conditions TCPisolates packet loss from its client (TLS in this example), and a singleTCP packet loss should not result in failed TLS record delivery.However, the TLS layer or the application above may incorporate atimeout strategy in a manner that is unaware of the underlying TCPconditions. Thus, if there's significant packet loss in the network, theTLS timeout may be hit (and thus result in a failure to the application)even though TCP would normally provide reliable delivery.

Referring back to FIG. 3, virtually every modern operating systemexecutes TLS from user space when e.g., securely connecting to othernetwork entities, inter alia, a web browser instance and a server. Butexisting implementations of TLS are not executed from the kernel (orother privileged software layer) due to e.g., the complexity of errorhandling within the kernel. However, as a practical matter, TLS wouldoperate significantly better with information regarding the currentnetworking conditions (held in the kernel).

Ideally, the TLS layer should set TLS record sizes based on networkcondition information. In particular, large TLS records can efficientlyuse network bandwidth, but require many successful TCP packetdeliveries. In contrast, small TLS records incur significantly morenetwork overhead, but can survive poor bandwidth conditions.Unfortunately, networking condition information is lower layerinformation that is available to the kernel space (e.g., the DLIL anddevice drivers), but generally restricted from user space applications.Some 3^(rd) party application developers and device manufacturers haveincorporated kernel extensions (or similar operating systemcapabilities) to provide network condition information to the TLS userspace applications; however, kernel extensions are undesirable due tothe aforementioned security and privacy concerns. Alternately, some3^(rd) party applications infer the presence of lossy network conditionsbased on historic TLS record loss. Such inferences are an indirectmeasure and significantly less accurate and lag behind real-timeinformation (i.e., previous packet loss often does not predict futurepacket loss).

FIG. 4 illustrates a logical block diagram of an exemplaryimplementation of a Virtual Private Network (VPN), useful to explainrecursive/cross-layer protocol layer complexities of emerging use cases.

As shown, an application executing from user space can open a VirtualPrivate Network (VPN) session 402 over a network socket 406 that offersTCP/IP transport 408, 410. The VPN session is secured with EncapsulatingSecurity Protocol (ESP) 412. The encrypted packet is securely tunneledvia TLS 404 (in user space) and recursively sent again over TCP/IPtransport 408, 410.

As illustrated within FIG. 4, the exemplary VPN tunnel starts in userspace, crosses into kernel space, returns back to user space, and thencrosses back into kernel space before being transferred. Each of thedomain crossings results in costly context switches and data shufflingboth of which are processor intensive and inefficient. More directly,every time data traverses from user space to kernel space, the data mustbe validated (which takes non-trivial processing time). Additionally,context switching can introduce significant latency while the task issuspended.

Artisans of ordinary skill in the related arts, given the contents ofthe present disclosure, will readily appreciate that the exemplaryrecursive cross layer transaction of FIG. 4 is merely illustrative of abroad range of applications which use increasingly exotic protocol layercompositions. For example, applications that traverse the applicationproxy/agent data path commonly require tunneling TCP (kernel space) overapplication proxy/agent data path (user space) over UDP/IP (kernelspace). Another common implementation is IP (kernel space) over QuickUDP Internet Connections (QUIC) (user space) over UDP/IP (kernel space).

FIG. 5 illustrates a logical block diagram of an exemplaryimplementation of application based tuning, useful to explain variousother workload optimization complexities of emerging use cases.

As shown, three (3) different concurrently executed applications (e.g.,a real time application 502, interactive application 504, and filetransfer applications 506) in user space, each open a session overnetwork sockets 508 (508A, 508B, 508C) that offer TCP/UDP/IP transport510/512. Depending on the type of physical interface required, thesessions are switched to BSD network interfaces (ifnet) 514 (514A, 514B,514C) which handle the appropriate technology. Three differentillustrated technology drivers are shown: Wi-Fi 516, Bluetooth 518, andcellular 520.

It is well understood within the networking arts that differentapplication types are associated with different capabilities andrequirements. One such example is real time applications 502, commonlyused for e.g., streaming audio/visual and/or other “live” data. Realtime data has significant latency and/or throughput restrictions;moreover, certain real time applications may not require (and/orsupport) retransmission for reliable delivery of lost or corrupted data.Instead, real time applications may lower bandwidth requirements tocompensate for poor transmission quality (resulting in lower quality,but timely, delivered data).

Another such example is interactive applications 504, commonly used fore.g., human input/output. Interactive data should be delivered atlatencies that are below the human perceptible threshold (within severalmilliseconds) to ensure that the human experience is relativelyseamless. This latency interval may be long enough for a retransmission,depending on the underlying physical technology. Additionally, humanperception can be more or less tolerant of certain types of datacorruptions; for example, audio delays below 20 ms are generallyimperceptible, whereas audio corruptions (pops and clicks) arenoticeable. Consequently, some interactive applications may allow forsome level of error correction and/or adopt less aggressive bandwidthmanagement mechanisms depending on the acceptable performancerequirements for human perception.

In contrast to real time applications and interactive applications, filetransfer applications 506 require perfect data fidelity without latencyrestrictions. To these ends, most file transfer technologies supportretransmission of lost or corrupted data, and retransmission can haverelatively long attempt intervals (e.g., on the order of multipleseconds to a minute).

Similarly, within the communication arts, different communicationtechnologies are associated with different capabilities andrequirements. For example, Wi-Fi 516 (wireless local area networkingbased on IEEE 802.11) is heavily based on contention based access and isbest suited for high bandwidth deliveries with reasonable latency. Wi-Fiis commonly used for file transfer type applications. Bluetooth 518(personal area networking) is commonly used for low data rate and lowlatency applications. Bluetooth is commonly used for human interfacedevices (e.g., headphones, keyboards, and mouses). Cellular networktechnologies 520 often provide non-contention based access (e.g.,dedicated user access) and can be used over varying geographic ranges.Cellular voice or video delivery is a good example of streaming dataapplications. Artisans of ordinary skill in the related arts willreadily recognize that the foregoing examples are purely illustrative,and that different communication technologies are often used to supporta variety of different types of application data. For example, Wi-Fi 516can support file transfer, real time data transmission and/orinteractive data with equivalent success.

Referring back to FIG. 5, the presence of multiple concurrentlyexecuting applications of FIG. 5 (real time application 502, interactiveapplication 504, and file transfer applications 506) illustrates thecomplexities of multi-threaded operation. As shown therein, theexemplary multi-threaded operation incurs a number of server loops. Eachserver loop represents a logical break in the process during which theprocessor can context switch (see also aforementioned discussion ofExisting Performance Optimization Technologies, and corresponding FIG.2).

Moreover, in the computing arts, a “locking” synchronization mechanismis used by the kernel to enforce access limits (e.g., mutual exclusion)on resources in multi-threaded execution. During operation, each threadacquires a lock before accessing the corresponding locked resourcesdata. In other words, at any point in time, the processor is necessarilylimited to only the resources available to its currently executingprocess thread.

Unfortunately, each of the applications has different latency,throughput and processing utilization requirements. Since, each of thenetwork interfaces is sending and receiving data at different times, indifferent amounts, and with different levels of priority. From a purelylogistical standpoint, the kernel is constantly juggling between highpriority kernel threads (to ensure that the high priority hardwareactivities do not stall out) while still servicing each of itsconcurrently running applications to attempt to provide acceptablelevels of service. In some cases, however, the kernel is bottlenecked bythe processor's capabilities. Under such situations, some threads willbe deprioritized; currently, the traditional networking stackarchitecture is unable it clearly identify which threads can bedeprioritized while still providing acceptable user service.

For example, consider an “expected use” device of FIG. 5; the processoris designed for the expected use case of providing streaming video.Designing for expected use cases allows the device manufacturer to useless capable, but adequate components thereby reducing bill of materials(BOM) costs and/or offering features at a reasonable price point forconsumers. In this case, a processor is selected that nominally meetsthe requirements for a streaming video application that is receivingstreaming video data via one of the network interfaces (e.g., the Wi-Fiinterface), and constantly servicing the kernel threads associated withit. Rendering the video with a real time application 502 from thereceived data is a user space application that is executed concurrentlybut at a significantly lower priority. During expected usage, the videorendering is adequate.

Unfortunately, the addition of an unexpected amount of additionalsecondary interactive applications 504 (e.g., remote control interface,headphones, and/or other interface devices) and/or background filetransfer applications can easily overwhelm the processor. Specifically,the primary real time application does not get enough CPU cycles to runwithin its time budget, because the kernel threads handling networkingare selected at a higher priority. In other words, the user spaceapplication is not able to depress the priority of kernel networkingthreads (which are servicing both the primary and secondary processes).This can result in significantly worse user experience when the videorendering stalls out (video frame misses or video frame drops); whereassimply slowing down a file transfer or degrading the interactioninterface may have been preferable.

Prior art solutions have tailored software for specific deviceimplementations (e.g., the Apple TV®). For example, the device can bespecifically programmed for an expected use. However, tailored solutionsare becoming increasingly common and by extension the exceptions haveswallowed the more generic use case. Moreover, tailored solutions areundesirable from multiple software maintenance standpoints. Devices havelimited productive lifetimes, and software upkeep is non-trivial.

Ideally, a per-application or per-profile workload optimization wouldenable a single processor (or multiple processors) to intelligentlydetermine when and/or how too intelligently context switch and/orprioritize its application load (e.g., in the example of FIG. 5, toprioritize video decode). Unfortunately, such solutions are not feasiblewithin the context of the existing generic network sockets and genericnetwork interfaces to a monolithic communications stack.

Exemplary Networking Architecture—

A networking stack architecture and technology that caters to the needsof non-kernel based networking use cases is disclosed herein. Unlikeprior art monolithic networking stacks, the exemplary networking stackarchitecture described hereinafter includes various components that spanmultiple domains (both in-kernel, and non-kernel), with varyingtransport compositions, workload characteristics and parameters.

In one exemplary embodiment, a networking stack architecture isdisclosed that provides an efficient infrastructure to transfer dataacross domains (user space, non-kernel, and kernel). Unlike thetraditional networking paradigm that hide the underlying networkingtasks within the kernel and substantially limits control thereof by anynon-kernel applications, the various embodiments described herein enablefaster and more efficient cross domain data transfers.

Various embodiments of the present disclosure provide a faster and moreefficient packet input/output (I/O) infrastructure than prior arttechniques. Specifically, unlike traditional networking stacks that usea “socket” based communication, disclosed embodiments can transfer datadirectly between the kernel and user space domains. Direct transferreduces the per-byte and per-packet costs relative to socket basedcommunication. Additionally, direct transfer can improve observabilityand accountability with traffic monitoring.

In one such variant, a simplified data movement model that does notrequire mbufs (memory buffers) is described in greater detail herein.During one such exemplary operation, the non-kernel processes canefficiently transfer packets directly to and from the in-kernel drivers.

In another embodiment, a networking stack architecture is disclosed thatexposes the networking protocol stack infrastructure to user spaceapplications via network extensions. In one such embodiment, the networkextensions are software agents that enable extensible,cross-platform-capable, user space control of the networking protocolstack functionality. In another such embodiment, an in-process userspace networking stack facilitates tighter integration between theprotocol layers (including TLS) and the application or daemon. In somecases, the user space architecture can expose low-level networkinginterfaces to transport protocols and/or encapsulation protocols such asUDP, TCP, and QUIC; and enable network protocol extensions and rapiddevelopment cycles. Moreover, artisans of ordinary skill in the relatedarts, given the contents of the present disclosure, will readilyappreciate that the various principles described herein may be appliedto a variety of other operating systems (such as Windows, Linux, Unix,Android), and/or other cross platform implementations.

In some variants, exemplary embodiments of the networking stack cansupport multiple system-wide networking protocol stack instances(including an in-kernel traditional network stack). Specifically, in onesuch variant, the exemplary networking stack architecture coexists withthe traditional in-kernel networking stack so as to preserve backwardscompatibility for legacy networking applications. In suchimplementations, the in-kernel network stack instance can coexist withthe non-kernel network stack via namespace sharing and flow forwarding.

As used herein, an “instance” may refer to a single copy of a softwareprogram or other software object; “instancing” and “instantiations”refers to the creation of the instance. Multiple instances of a programcan be created; e.g., copied into memory several times. Software objectinstances are instantiations of a class; for example, a first softwareagent and second software instance are each distinct instances of thesoftware agent class.

In one such implementation, load balancing for multiple networkingstacks is handled within the kernel, thereby ensuring that no singlenetworking stack (including the in-kernel stack) monopolizes systemresources.

As a related variant, current/legacy applications can be handled withinthe in-kernel stack. More directly, by supporting a separate independentin-kernel BSD stack, legacy applications can continue to work withoutregressions in functionality and performance.

FIG. 6 illustrates one logical representation of an exemplary networkingstack architecture, in accordance with the various aspects of thepresent disclosure. While the system depicts a plurality of user spaceapplications 602 and/or legacy applications 612, artisans of ordinaryskill will readily appreciate given the contents of present disclosurethat the disclosed embodiments may be used within single applicationsystems with equivalent success.

As shown, a user space application 602 can initiate a network connectionby instancing user space protocol stacks 604. Each user space protocolstacks includes network extensions for e.g., TCP/UDP/QUIC/IP,cryptography, framing, multiplexing, tunneling, and/or any number ofother networking stack functionalities. Each user space protocol stack604 communicates with one or more nexuses 608 via a channel input/output(I/O) 606. Each nexus 608 manages access to the network drivers 610.Additionally shown is legacy application 612 support via existingnetwork socket technologies 614. While the illustrated embodiment showsnexus connections to both user space and in-kernel networking stacks, itis appreciated that the nexus may also enable e.g., non-kernelnetworking stacks (such as may be used by a daemon or other non-kernel,non-user process).

The following topical sections hereinafter describe the salient featuresof the various logical constructs in greater detail.

Exemplary I/O Infrastructure

In one exemplary embodiment, the non-kernel networking stack provides adirect channel input output (I/O) 606. In one such implementation, thechannel I/O 606 is included as part of the user space protocol stack604. More directly, the channel I/O 606 enables the delivery of packetsas a raw data I/O into kernel space with a single validation (e.g., onlywhen the user stack provides the data to the one or more nexuses 608).The data can be directly accessed and/or manipulated in situ, the dataneed not be copied to an intermediary buffer.

In one exemplary implementation, a channel is an I/O scheme leveragingkernel-managed shared memory. During an access, the channel I/O ispresented to the process (e.g., the user process or kernel process) as afile descriptor based object, rather than as data. In order to accessthe data, the process de-references the file descriptor for directaccess to the shared memory within kernel space. In one suchimplementation, the file descriptor based object based I/O is compatiblewith existing operating system signaling and “eventing” (eventnotification/response) mechanisms. In one exemplary variant, the channelI/O is based on Inter Process Communication (IPC) packets.

As used herein, the term “descriptor” may refer to data structures thatindicate how other data is stored. Descriptors generally includemultiple parameters and can be used to identify more complex datastructures; for example, a descriptor may include one or more of type,size, address, tag, flag, headers, footers, metadata, structural linksto other data descriptors or locations, and/or any other number offormat or construction information.

Within the context of the present disclosure, as used herein, the term“pointer” may refer to a specific reference data type that “points” or“references” a location of data in memory. Typically, a pointer stores amemory address that is interpreted by a compiler as an absolute locationin system memory or a relative location in system memory based on e.g.,a base address, reference address, memory window, or other memorysubset. During operation, a pointer is “de-referenced” to recover thedata that is stored in the location of memory.

As used herein, the term “metadata” refers to data that describes data.Metadata varies widely in application, but generally falls into one ofthe descriptive, structural, and/or administrative categories.Descriptive metadata describes data in a manner to enable e.g.,discovery and/or identification. Common examples include withoutlimitation e.g., type, size, index tags, and keywords. Structuralmetadata describes the structure of the data e.g., how compound objectsare put together. Common examples include without limitation e.g.,prefix, postfix, table of contents, order, and/or any other informationthat describes the relationships and other characteristics of digitalmaterials. Administrative metadata provides information to help manage aresource; common examples include e.g., authorship and creationinformation, access privileges, and/or error checking and security basedinformation (e.g., cyclic redundancy checks (CRC), parity, etc.).

In one exemplary embodiment, the channel I/O can be further leveraged toprovide direct monitoring of its corresponding associated memory. Moredirectly, unlike existing data transfers which are based on mbuf baseddivide/copy/move, etc., the channel I/O can provide (with appropriateviewing privileges) a direct window into the memory accesses of thesystem. Such implementations further simplify software development asdebugging and/or traffic monitoring can be performed directly ontraffic. Direct traffic monitoring can reduce errors attributed to falsepositives/false negatives caused by e.g., different software versioning,task scheduling, compiler settings, and/or other software introducedinaccuracies.

More generally, unlike prior art solutions which relied on specializednetworking stack compositions to provide different degrees of visibilityat different layers, the monitoring schemes of the present disclosureprovide consistent system-wide channel monitoring infrastructures.Consistent frameworks for visibility, accounting, and debugging greatlyimprove software maintenance and upkeep costs.

Additionally, simplified schemes for egress filtering can be used toprevent traffic spoofing for user space networking stack instances. Forexample, various embodiments ensure that traffic of an applicationcannot be hijacked by another malicious application (by the latterclaiming to use the same tuple information, e.g. TCP/UDP port).

In one exemplary embodiment, the in-kernel network device drivers (e.g.Wi-Fi, Cellular, Ethernet) use simplified data movement models based onthe aforementioned channel I/O scheme. More directly, the user spacenetworking stacks can directly interface to each of the variousdifferent technology based network drivers via channel I/O; in thismanner, the user space networking stacks do not incur the traditionaldata mbuf based divide/copy/move penalties. Additionally, user spaceapplications can directly access user space networking components forimmediate traffic handling and processing.

Exemplary Nexus—

In one exemplary embodiment, the networking stack connects to one ormore nexus 608. In one such implementation, the nexus 608 is a kernelspace process that arbitrates access to system resources including,without limitation e.g., shared memory within kernel space, networkdrivers, and/or other kernel or user processes. In one such variant, thenexus 608 aggregates one or more channels 606 together for access to thenetwork drivers 610 and/or shared kernel space memory.

In one exemplary implementation, a nexus is a kernel process thatdetermines the format and/or parameters of the data flowing through itsconnected channels. In some variants, the nexus may further performingress and/or egress filtering.

The nexus may use the determined format and/or parameter information tofacilitate one-to-one and one-to-many topologies. For example, the nexuscan create user-pipes for process-to-process channels; kernel-pipes forprocess-to-kernel channels; network interfaces for direct channelconnection from a process to in-kernel network drivers, or legacynetworking stack interfaces; and/or flow-switches for multiplexing flowsacross channels (e.g., switching a flow from one channel to one or moreother channels).

Additionally, in some variants the nexus may provide the format,parameter, and/or ingress egress information to kernel processes and/orone or more appropriately privileged user space processes.

In one exemplary embodiment, the nexus 608 may additionally ensure thatthere is fairness and/or appropriately prioritize each of its connectedstacks. For example, within the context of FIG. 6, the nexus 608balances the network priorities of both the existing user spaceapplication networking stacks 604, as well as providing fair access forlegacy socket based access 614. For example, as previously alluded to,existing networking stacks could starve user space applications becausethe kernel threads handling the legacy networking stack operated athigher priorities than user space applications. However, the exemplarynexus 608 ensures that legacy applications do not monopolize systemresources by appropriately servicing the user space network stacks aswell as the legacy network stack.

In one such embodiment, in-kernel, non-kernel, and/or user spaceinfrastructures ensure fairness and can reduce latency due to e.g.,buffer bloat (across channels in a given nexus, as well as flows withina channel). In other words, the in-kernel and/or user spaceinfrastructures can negotiate proper buffering sizes based on theexpected amount of traffic and/or network capabilities for each flow. Bybuffering data according to traffic and/or network capability, buffersare not undersized or oversized.

As a brief aside, “buffer bloat” is commonly used to describe e.g., highlatency caused by excessive buffering of packets. Specifically, bufferbloat may occur when excessively large buffers are used to support areal time streaming application. As a brief aside, TCP retransmissionmechanism relies on measuring the occurrence of packet drops todetermine the available bandwidth. Under certain congestion conditions,excessively large buffers can prevent the TCP feedback mechanism fromcorrectly inferring the presence of a network congestion event in atimely manner (the buffered packets “hide” the congestion, since theyare not dropped). Consequently, the buffers have to drain before TCPcongestion control resets and the TCP connection can correct itself.

Referring back to FIG. 6, in one exemplary embodiment, Active QueueManagement (AQM) can be implemented in the kernel across one or more(potentially all) of the flow-switch clients (user space and in-kernelnetworking stack instances). AQM refers to the intelligent culling ofnetwork packets associated with a network interface, to reduce networkcongestion. By dropping packets before the queue is full, the AQMensures no single buffer approaches its maximum size, and TCP feedbackmechanisms remain timely (thereby avoiding the aforementioned bufferbloat issues).

While the foregoing example is based on “fairness” standard, artisans ofordinary skill in the related arts will readily appreciate that otherschemes may be substituted with equivalent success given the contents ofthe present disclosure. For example, some embodiments may dynamically orstatically service the user application networking space with greater orless weight compared to the legacy socket based access. For example,user application networking space may be more heavily weighted toimprove overall performance or functionality, whereas legacy socketbased access may be preferred where legacy applications arepreferentially supported (e.g., see Protocol Onloading Offloading,discussed infra).

Exemplary Network Extensions

In one exemplary embodiment of the present disclosure, a networkextension is disclosed. A network extension is an agent-based extensionthat is tightly coupled to network control policies. The agent isexecuted by the kernel and exposes libraries of network controlfunctionality to user space applications. During operation, user spacesoftware can access kernel space functionality through the context andprivileges of the agent.

As used herein, the term “agent” may refer to a software agent that actsfor a user space application or other program in a relationship ofagency with appropriate privileges. The agency relationship between theagent and the user space application implies the authority to decidewhich, if any, action is appropriate given the user application andkernel privileges. A software agent is privileged to negotiate with thekernel and other software agents regarding without limitation e.g.,scheduling, priority, collaboration, visibility, and/other sharing ofuser space and kernel space information. While the agent negotiates withthe kernel on behalf of the application, the kernel ultimately decideson scheduling, priority, etc.

Various benefits and efficiencies can be gained through the use ofnetwork extensions. In particular, user space applications can controlthe protocol stack down to the resolution of exposed threads (i.e., thethreads that are made available by the agent). In other words, softwareagents expose specific access to lower layer network functionality whichwas previously hidden or abstracted away from user space applications.For example, consider the previous examples of TLS record sizing (seee.g., FIG. 3, and related discussion); by exposing TCP networkconditions to the TLS application within the user space, the TLSapplication can correctly size records for network congestion and/orwait for underlying TCP retransmissions (rather than timing out).

Similarly, consider the previous examples of multi-threading within thecontext of expected use devices (see e.g., FIG. 5, and relateddiscussion); the primary user space application (e.g., video coding) andadditional secondary interactive applications (e.g., remote controlinterface, headphones, and/or other interface devices) can internallynegotiate their relative priority to the user's experience. The userspace applications can appropriately adjust their priorities for thenexus (i.e., which networking threads are serviced first and/or shouldbe deprioritized). Consequently, the user space applications candeprioritize non-essential network accesses, thereby preserving enoughCPU cycles for video decode.

As a related benefit, since a software agent represents the applicationto the kernel; the agent can trust the kernel, but the kernel may or maynot trust the agent. For example, a software agent can be used by thekernel to convey network congestion information in a trusted manner tothe application; similarly, a software agent can be used by anapplication to request a higher network priority. Notably, since asoftware agent operates from user space, the agent's privilege is notpromoted to kernel level permissions. In other words, the agent does notpermit the user application to exceed its privileges (e.g., the agentcannot commandeer the network driver at the highest network priority, orforce a read/write to another application's memory space without theother kernel and/or other application's consent).

Networking extensions allow the user space application to executenetworking communications functionality within the user space andinterpose a network extension between the user space application and thekernel space. As a result, the number of cross domain accesses forcomplex layering of different protocol stacks can be greatly reduced.Limiting cross domain accesses prevents context switching and allows theuser space to efficiently police its own priorities. For example,consider the previous example of a VPN session as was previouslyillustrated in FIG. 4. By keeping the TCP/IP, Internet Protocol Security(IPsec) and TLS operations within user space, the entire tunnel can beperformed within the user space, and only cross the user/kernel domainonce.

As used herein, the term “interposition” may refer to the insertion ofan entity between two or more layers. For example, an agent isinterposed between the application and the user space networking stack.Depending on the type of agent or network extension, the interpositioncan be explicit or implicit. Explicit interposition occurs where theapplication explicitly instances the agent or network extension. Forexample, the application may explicitly call a user space tunnelextension. In contrast, implicit interposition occurs where theapplication did not explicitly instance the agent or network extension.Common examples of implicit interposition occur where one user spaceapplication sniffs the traffic or filters the content of another userspace application.

Namespace Sharing & Flow Forwarding Optimizations

In one exemplary optimization of the present disclosure, the nexusincludes a namespace registration and management component that managesa common namespace for all of its connected networking stack instances.As a brief aside, a namespace generally refers to a set of uniqueidentifiers (e.g., the names of types, functions, variables) within acommon context. Namespaces are used to prevent naming “collisions” whichoccur where multiple processes call the same resource differently and/orcall different resources the same.

In one such implementation, the shared networking protocol has a commonnamespace (e.g., {Address, Protocol, and Port}) across multiplenetworking stack instances. Sharing a namespace between differentnetworking stacks reduces the amount of kernel burden, as the kernel cannatively translate (rather than additionally adding a layer of networkaddress translation).

For example, if a first application acquires port 80, the namespaceregistration ensures that other applications will not use port 80 (e.g.,they can be assigned e.g., port 81, 82, etc.) In some suchimplementations, legacy clients may use default namespaces that conflict(e.g., a default web client may always select port 80); thus the sharednamespace registration may also be required to force a re-assignment ofa new identifier (or else translate for) such legacy applications.

In one exemplary embodiment, the namespace registration and managementcomponents control flow-switching and forwarding logic of eachflow-switch nexus instance. For example, as previously noted, the nexuscan create user-pipes for process-to-process channels; kernel-pipes forprocess-to-kernel channels; network interfaces for direct channelconnection from a process to in-kernel network drivers, or legacynetworking stack interfaces; and/or flow-switches for multiplexing flowsacross channels (e.g., switching a flow from one channel to one or moreother channels).

For example, during normal operation when an application requests aport, the namespace registration and management will create a flow andassign a particular port to the application. Subsequent packetsaddressed to the port will be routed appropriately to the flow'scorresponding application. In one such variant, packets that do notmatch any registered port within the shared namespace registration andmanagement will default to the legacy networking stack (e.g., theflow-switch assumes that the unrecognized packet can be parsed and/orignored by the fallback legacy stack).

Artisans of ordinary skill in the related arts will readily appreciate,given the contents of the present disclosure that disparate and/orotherwise distinct namespace registrations and/or management componentsmay be preferable based on other implementation specific considerations.For example, some implementations may prefer to shield namespaces fromother external processes e.g., for security and/or privacyconsiderations. In other implementations, the benefits associated withnative namespace translation may be less important than supportinglegacy namespaces.

Protocol Onloading and Offloading

In the Foregoing Discussions, the Improvements to User Space OperationMay be primarily due to the user space networking stack, as shown inFIG. 6. However, various embodiments of the present disclosure alsoleverage the existing legacy host networking infrastructure to handlenetworking transactions which are unrelated to user experience.

Colloquially, the term “hardware offload” may be commonly used to denotetasks which can be handled within dedicated hardware logic to improveoverall processing speed or efficiency. One such example is the cyclicredundancy check (CRC) calculation which is an easily parameterized,closed, iterative calculation. The characteristics of CRC calculationlend itself to hardware offload because the CRC does not benefit fromthe flexibility of a general purpose processor, and CRC calculations arespecialized functions that are not transferable to other processingoperations.

By analogous extension, as used herein, the term “protocol offload” mayrefer to processes that should be handled within the legacy networkingstack because they are not specific to a user space application or task.In contrast, the term “protocol onload” may refer to processes thatshould be handled within a user space networking stack because they arespecific to a user space application or task and benefit the overallperformance. As a general qualitative criteria, tasks which are “fast”(e.g., generally UDP/TCP/IP based user space applications) are protocolonloaded to improve user performance; in contrast “slow” tasks (e.g.,ARP, IPv6 Neighbor Discovery, Routing table updates, control path formanaging interfaces, etc.) are protocol offloaded.

For example, consider Address Resolution Protocol (ARP) requesthandling; when an ARP request comes in, the host processor responds witha reply. However, the ARP request is non-specific to a user spaceapplication; rather the ARP reply concerns the holistic system. Moregenerally, any networking process that is not specific to an applicationspace can be implemented within the kernel under legacy techniques.Alternatively, any process that can be handled regardless of devicestate should remain with the kernel (e.g., the kernel persists acrosslow power states, and is never killed).

By allowing the mature in-kernel networking stack to retain ownership ofcertain control logic (e.g. routing and policy table, interfaceconfiguration, address management), various embodiments of the presentdisclosure avoid “split-brain” behaviors. In other words, the kernelensures that networking data and/or availability remains consistentregardless of the user space application availability.

Exemplary User Space Networking Stack

Referring now to FIG. 7, one logical block diagram of an exemplary userspace networking stack 700 is depicted. As shown, the user spacenetworking stack 700 includes an application interface 702, and anoperating system interface 704. Additionally, the user space networkingstack includes one or more user space instances of TLS 706, QUIC 708,TCP 710, UDP 712, IP 714, and ESP 716. The disclosed instances arepurely illustrative, artisans of ordinary skill in the related arts willreadily appreciate that any other user space kernel extension and/orsocket functionality may be made available within the user spacenetworking stack 700.

In one exemplary embodiment, the user space networking stack 700 isinstantiated within an application user space 718. More directly, theuser space networking stack 700 is treated identically to any one ofmultiple threads 710 within the application user space 718. Each of thecoexisting threads 720 has access to the various functions and librariesoffered by the user space networking stack via a direct function call.

As a brief aside, each of the threads 720 reside within the same addressspace. By virtue of their shared addressability, each of the threads maygrant or deny access to their portions of shared address space viaexisting user space memory management schemes and/or virtual machinetype protections. Additionally, threads can freely transfer datastructures from one to the other, without e.g., incurring cross domainpenalties. For example, TCP data 710 can be freely passed to TLS 706 asa data structure within a user space function call.

As previously noted, the user space networking stack 700 may grant ordeny access to other coexistent user space threads; e.g., a user spacethread is restricted to the specific function calls and privileges madeavailable via the application interface 702. Furthermore, the user spacenetworking stack 700 is further restricted to interfacing the operatingsystem via the specific kernel function calls and privileges madeavailable via the operating system interface 704. In this manner, boththe threads and the user space networking stack have access andvisibility into the kernel space, without compromising the kernel'ssecurity and stability.

One significant benefit of the user space networking stack 700 is thatnetworking function calls can be made without acquiring various locksthat are present in the in-kernel networking stack. As previously noted,the “locking” mechanism is used by the kernel to enforce access limitson multiple threads from multiple different user space applications;however in the user space, access to shared resources are handled withinthe context of only one user application space at a time, consequentlyaccess to shared resources are inherently handled by the singlethreading nature of user space execution. More directly, only one threadcan access the user space networking stack 700 at a time; consequently,kernel locking is entirely obviated by the user space networking stack.

Another benefit of user space based network stack operation is crossplatform compatibility. For example, certain types of applications(e.g., iTunes®, Apple Music® developed by the Assignee hereof) aredeployed over a variety of different operating systems. Similarly, someemerging transport protocols (e.g. QUIC) are ideally served by portableand common software between the client and server endpoints. Consistencyin the user space software implementation allows for better and moreconsistent user experience, improves statistical data gathering andanalysis, and provides a foundation for enhancing, experimenting anddeveloping network technologies used across such services. In otherwords, a consistent user space networking stack can be deployed over anyoperating system platform without regard for the native operating systemstack (e.g., which may vary widely).

Another important advantage of the exemplary user space networking stackis the flexibility to extend and improve the core protocolfunctionalities, and thus deliver specialized stacks based on theapplication's requirements. For example, a video conferencingapplication (e.g., FaceTime® developed by the Assignee hereof) maybenefit from a networking stack catered to optimize performance forreal-time voice and video-streaming traffics (e.g., by allocating moreCPU cycles for video rendering, or conversely deprioritizing unimportantancillary tasks). In one such variant, a specialized stack can bedeployed entirely within the user space application, without specializedkernel extensions or changes to the kernel. In this manner, thespecialized user space networking stack can be isolated from networkingstacks. This is important both from a reliability standpoint (e.g.,updated software doesn't affect other software), as well as to minimizedebugging and reduce development and test cycle times.

Furthermore, having the network transport layer (e.g. TCP, QUIC) residein user space can open up many possibilities for improving performance.For example, as previously alluded to, applications (such as TLS) can bemodified depending on the underlying network connections. User spaceapplications can be collapsed or tightly integrated into networktransports. In some variants, data structure sizes can be adjusted basedon immediate lower layer network condition information (e.g., toaccommodate or compensate for poor network conditions). Similarly,overly conservative or under conservative transport mechanisms can beavoided (e.g., too much or not enough buffering previously present atthe socket layer). Furthermore, unnecessary data copies and/ortransforms can be eliminated and protocol signaling (congestion, error,etc.) can be delivered more efficiently.

In yet another embodiment, the exemplary user space networking stackfurther provides a framework for both networking clients and networkingproviders. In one such variant, the networking client framework allowsthe client to interoperate with any network provider (including thelegacy BSD stack). In one such variant, the network provider frameworkprovides consistent methods of discovery, connection, and data transferto networking clients. By providing consistent frameworks for clientsand providers which operate seamlessly over a range of differenttechnologies (such as a VPN, Bluetooth, Wi-Fi, cellular, etc.), theclient software can be greatly simplified while retaining compatibilitywith many different technologies.

Exemplary Proxy Agent Application Operation

FIG. 8 depicts one logical flow diagram useful to summarize theconvoluted data path taken for a prior art application using a proxyagent application within the context of the traditional networkingstack. As shown therein, an application 802 transmits data via a socket804A to route data packets to a proxy agent application 814 via a TCP/IP806/808 and a BSD network interface 810A. The data packets enter kernelspace; this is a first domain crossing which incurs validation andcontext switching penalties.

Inside the kernel, the data is divided/copied/moved for delivery via theTCP/IP stack 806/808 to the BSD network interface 810A. The BSD networkinterface 810A routes the data to a virtual driver 812A. These steps mayintroduce buffering delays as well as improper buffer sizing issues suchas buffer bloat.

In order to access the application proxy (which is in a different userspace), the virtual driver reroutes the data to a second socket 804Bwhich is in the different user space from the original application. Thisconstitutes a second domain crossing, which incurs additional validationand context switching penalties.

In user space, the data enters an agent 814 which prepares the data fordelivery (tunneling 816, framing 818, and cryptographic security 820).Thereafter, the proxy agent 814 transmits the prepared data via a socket804B to route data packets to a user space driver 822 via the TCP/IP806/808 and a separate BSD network interface 810B. Again, the data ispassed through the socket 804B. This is a third domain crossing, withvalidation and context switching penalties.

Inside the kernel, the data is divided/copied/moved for delivery via theTCP/IP stack 806/808 to a BSD network interface 810B. The steps of TheBSD network interface 810B routes the data to a virtual driver 812B.These steps introduce additional buffering delays as well as improperbuffer sizing issues such as buffer bloat.

Finally, the virtual driver 812B reroutes the data to the user spacedriver (e.g., a Universal Serial Bus (USB) driver), which requiresanother socket transfer from 804B to 804C; the data crosses into theuser space for the user based driver 822, and crosses the domain a fifthtime to be routed out the USB Hardware (H/W) driver 824. Each of thesedomain crossings are subject to the validation and context switchingpenalties as well as any buffering issues.

FIG. 9 depicts one logical flow diagram useful to summarize an exemplaryproxy agent application within the context of the user space networkingstack, in accordance with the various aspects of the present disclosure.

As shown therein, an application 902 provides data via shared memoryspace file descriptor objects to the agent 904. The agent 904 internallyprocesses the data via TCP/IP 906/908 to the tunneling function 910.Thereafter, the data is framed 912, cryptographically secured 914, androuted via TCP/IP 906/908 to the user driver 916. The user driver uses achannel I/O to communicate with nexus 918 for the one (and only) domaincrossing into kernel space. Thereafter, the nexus 918 provides the datato the H/W driver 920.

When compared side-by-side, the user space networking stack 900 has onlyone (1) domain crossing, compared to the traditional networking stack800 which crossed domains five (5) times for the identical VPNoperation. Moreover, each of the user space applications could directlypass data via function calls within user memory space between each ofthe intermediary applications, rather than relying on the kernel basedgeneric mbuf divide/copy/move scheme (and its associated bufferinginefficiencies).

Virtualized Hardware Enhancements—

Existing devices often source components from a variety of differentvendors. Multi-sourcing allows for price competition. Differentcomponents may offer different levels of functionalities. And, eachvendor of a particular component may align the level of implementedfunctionality to the required functionality or functionalities for theparticular component, based on the functionality or functionalitiesimplemented by the other components of an existing device. Therefore, anexisting device may be made up of a variety of combinations andpermutations of multiple components offering different sets offunctionalities, as long as the different sets of functionalitiescombine to support the requisite set of functionalities for the device.

Some manufacturers differentiate themselves by offering low costsolutions. For example, a manufacturer for a particular component of anexisting device may offer a minimum level of functionality at a lowcost. Such component may rely on certain functionalities to beimplemented by other component(s) of the device. Because of the relianceon the other component(s) of the device for the certain functionalities,the required level of functionality for the particular component may berelatively low. Therefore, a manufacturer of a component maydifferentiate itself by offering the lowest-costing solution for acomponent by minimizing the level of functionality and associated costwhile relying on certain functionalities to be implemented by othercomponents for the device.

On the other hand, other manufacturers differentiate themselves byoffering more functionality. For example, a manufacturer for aparticular component of an existing device may focus on offering amaximum level of functionality. Such component may alleviate the need ofcertain functionalities to be implemented by other component(s) of theexisting device. Because such component alleviates the need of certainfunctionalities to be implemented by other component(s) of the existingdevice, the device may not require as many components to serve the sameset of functionalities. Therefore, a manufacturer of a component maydifferentiate itself by offering the highest level of functionality tolessen the level of functionality required from other components of thedevice.

Hardware information could be exposed to user space communicationstacks. However, hardware specific information should be abstracted (or“virtualized”) away from the user space communication stacks for avariety of considerations (e.g., security, privacy, and/or efficiency).To these ends, various embodiments of the present disclosure providetechniques for virtualizing hardware behavior.

Efficient Copy-Checksum Mechanism

Existing TCP/IP formats require that protocol data is provided with anassociated checksum. A checksum is a small-sized datum derived from ablock of digital data for the purpose of detecting errors that may havebeen introduced during its transmission or storage. A procedure thatyields the checksum from a data input is called checksum function orchecksum algorithm. Checksums are often used to verify data integrity.Some examples of the checksum algorithms most used in practice includewithout limitation: longitudinal parity check, usage of modular sums,Fletcher's checksum, Adler-32, and cyclic redundancy checks (CRCs).

The checksum may be provided either in hardware or handled in software.For example, certain Wi-Fi chipsets may offer checksum hardwareacceleration to improve performance, whereas other Wi-Fi chipsets maynot provide checksum hardware to reduce cost. Currently, BSD stacksquery the NIC in order to determine whether or not the NIC supportschecksum.

As a brief aside, copying packet buffers from user space stack is anecessity when the application is not trusted. Specifically, thekernel-only can be validated, sanitized, and/or modified before handingthe data to the driver. The extra copy is transactional overhead. Sincethe extra step of copying data cannot be avoided, various embodiments ofthe present disclosure seek to defray the transactional overhead withsimultaneous data manipulations.

In one exemplary embodiment, the flow-switch could perform a checksumwhile copying user space data to kernel space. The so-called“copy-checksum” is presented to the user space protocol stack as avirtualized network port that provides e.g. hardware acceleratedchecksum offloading. While copy-checksum is slightly morecomputationally expensive than a copy-only; the combined copy-checksumis much cheaper than existing software based checksums which perform thecopy and checksum separately. Specifically, existing software basedchecksum logic is performed separately from the copy so as to supportdifferent NICs (which may or may not support checksum in hardware). Theseparate copy and checksum operations suffer from extraneous access andcontext switch penalties.

FIG. 10 is a side-by-side comparison of two devices that aremanufactured and/or ostensibly sold as the same product. As showntherein, the internal components may offer a particular functionality(e.g., checksum function) or may not. For example, device 1000illustrates a device that includes checksum functionality and is in datacommunication with a modem without hardware checksum function 1014. Onthe other hand, device 1020 illustrates another device that does notinclude checksum functionality and is in data communication with a modemwith hardware checksum function 1034. As long as the devices made ofthese components are able to offer a same cumulative level offunctionality, the different combinations of constituent components mayvary in their functionalities.

During operation, the software determines what hardware functionality issupported, and responsively enables or disables a particularfunctionality (e.g., checksum function). If the particular functionalityis not supported by the hardware, the software needs to enable thefunctionality so that the device operates properly. On the other hand,if the particular functionality is supported by the hardware, thesoftware may disable the functionality and free up resources for otherpurposes. In this way, the availability of hardware offload asrecognized by the software, may allow a particular functionality to beeither enabled or disabled in the software. When the software is able todetermine that a particular functionality may be disabled in thesoftware, it may free up resources to allow more efficient use of theresources.

For example, when an existing device 1000 includes a modem withouthardware checksum function 1014, the existing device 1000 enables achecksum function 1010 in a kernel space of the system. In this example,a packet of data travels from an application 1002 of user space tokernel space via socket 1004. After parsing of the packet by the TCP/IPstack 1006/1008, the packet is processed further by the checksumcomponent 1010. After the additional parsing done by the checksumcomponent 1010 is completed, the packet is then routed via driversoftware 1012 to the modem without hardware checksum function 1014.

On the other hand, when an existing device 1020 uses a modem withhardware checksum function 1034, the existing device 1020 disables thechecksum function in the kernel space of the system. In this example, apacket of data travels from an application 1022 of user space to kernelspace via socket 1024. After parsing of the packet by the TCP/IP stack1026/1028, the packet simply passes through to the driver software 1032.Then, the driver software 1032 routes the packet to the modem (whichperforms the checksum function 1034 in hardware).

Notably, software based checksum suffers from substantial performancepenalties. First, the kernel must copy the entire word to memory inorder to ensure that the checksum operation cannot be tampered with(checksums in user space cannot be trusted). A software based checksumfunction requires reading each word from memory for calculation andaccumulation of a checksum value. Consequently, the software basedchecksum in kernel space pays an additional penalty of parsing of thecopied data to implement the checksum function. In other words, existingBSD software checksums must be performed on data that's already beencopied once.

FIG. 11 is a logical block diagram of one exemplary user networkingdevice providing copy-checksum optimizations.

As shown therein, each application 1102 is instantiated in its ownapplication space within the user space of the system. The disclosedinstances are purely illustrative, and artisans of ordinary skill in therelated arts will readily appreciate that any number of applicationsrequiring network communication may be instantiated with variousentities that are not illustrated in this drawing.

The application space includes one or more user space networking stacks1104 that provide networking functionality. Each user space application1102 is in data communication with its own instance of a user spacenetworking stack 1104. In the illustrated embodiments, the user spacenetwork stack 1104 processes at least TCP/UDP/IP packaging. As describedherein, the checksum function may be disabled (as if it were to beimplemented by a different component in the system e.g., hardwareoffload) and handled as part of a copy-checksum operation. Copy-checksumoccurs in both directions (transmission and reception). Duringtransmission (the illustrated embodiment) checksum is computed whilecopying the packet from the user's packet pool to the driver's packetpool, this computed checksum is inserted in the packet's protocolheader. In the receive direction (not shown), checksum is computed whilecopying the packet from the driver's packet pool to the user's packetpool, the computed checksum is then placed in the packet metadata. Theuser space TCP/IP stack uses the computed checksum in the packetmetadata to verify the checksum from the protocol header of the receivedpacket and rejects the packets if the verification fails. Artisans ofordinary skill in the related arts will appreciate that otherimplementations may perform the copy-checksum when moving user spacedata (generated by stack 1104) to the user packet pool 1106 or from thedriver 1112 to the driver pool 1110.

Notably, the virtual network port appears to provide the benefits of“hardware-based” checksum acceleration even though the copy-checksum isperformed in software. Additionally, the user space application 1102and/or user space networking stack 1104 will experience actualperformance improvements from the checksum offload (the processing loadhas not merely been shifted to a different software process).Specifically, the TCP/IP packets from user space networking stack 1104are stored in a memory pool 1106, and copied from the user space packetpool 1106 to the driver packet pool 1110 in a word-wise copy-checksumoperation by nexus 1108. Unlike a bulk data copy (e.g., copying packetsof data from user packet pool to the driver packet pool) that separatelycalculates the checksum, the word size copy-checksum operation allowsthe system to do the checksum as part of the copy (entirely obviatingthe aforementioned access penalty).

More directly, the copy-checksum operation is both a copy operation anda checksum function. While this is more expensive than a traditionalcopy, it is less expensive than copying the entire packet as bulk data,and then retrieving each word of the packet data, updating the checksum,and storing back. The packet that has been processed by thecopy-checksum operation is then copied into another pool of memory 1110and then passed to the hardware driver 1112 so that it can be routed tothe modem without checksum function 1114.

One advantage of the foregoing is that the user space networking stackalways sees a hardware accelerated checksum. This simplifies user spacenetworking stack operation. Because the user space networking stackalways sees a hardware accelerated checksum, it does not need toimplement a checksum function for itself. Furthermore, the user spacenetworking stack of this embodiment may not need to run a check to seedetermine whether it needs to enable the checksum function. Also, theresulting data can be sent via commodity components that do not offerchecksum functionality (which may be lower cost).

While the foregoing example is presented in the context of a datatransmission, the various principles described herein may be used duringdata reception with equal success.

Methods—

FIG. 12 is a logical flow diagram of one exemplary method 1200 forvirtualizing hardware functionality in user space networking stacks.

As used herein, “virtualizing” refers to software emulation of platformoperation and/or hardware functionality. For example, virtualization ofa hardware checksum refers to software emulation of an equivalentchecksum logic. As used herein, software refers to a sequentialexecution of one or more instructions stored within a non-transitorycomputer readable medium. More sophisticated software may bemulti-threaded, parallelized, branching, and/or contextually switched,during sequential execution. Hardware refers to circuitry that performslogic by virtue of its physical construction; e.g., transistors, ADC,DAC, and/or other components. While the present disclosure isillustrated within the context of virtualization of a function in itsentirety, artisans of ordinary skill in the related arts will appreciatethat virtualization may also occur in part. For example, certainportions of functionality may be handled in software and/or theremaining portions in hardware (or vice versa).

Common examples of platform operation and/or hardware functionalityinclude without limitation: data formatting (e.g., line coding, parity,checksum), control path signaling (e.g., re-timing, timestampmodification, translation, etc.), data modifications (e.g., scrambling,encryption, decryption, etc.). More generally, artisans of ordinaryskill in the related arts will readily appreciate that a variety ofdifferent functionalities that are handled within dedicated hardwareand/or firmware may be performed in-line during data transfers via anexus, with equal success.

Virtualizing hardware functionalities enables pass-throughfunctionalities in user space networking stacks. Pass-throughfunctionality optimizes overall performance by avoiding unnecessaryprocessing and/or appropriately prioritizing processing. For example,virtualization of hardware checksum offloading by flow-switch allowsuser space networking stacks to skip checksum processing in user space,even when the system includes a hardware that does not offer hardwarechecksum offloading. Additionally, rather than performing bulk copiesand checksum operations in kernel space; the kernel can perform aword-wise copy-checksum to avoid access penalties. Notably, kernel spaceprocesses are also handled at a higher priority than user spaceprocesses; thus a kernel space copy-checksum prevents context switchingand/or other interruptions (user space equivalents could be interruptedby higher priority kernel processes).

At step 1202 of the method 1200, a nexus reads data from a first pool ofmemory resources. In one exemplary embodiment, the first pool of memoryis associated with a first application. In other embodiments, the firstpool of memory may be associated with a driver and/or network interface.Still other embodiments, may associate the first pool of memory witheither a user space process or a kernel space process.

In one embodiment, the data includes packet data for transfer. Otherexamples of data may include without limitation: raw data forprocessing, encrypted data for decryption, etc. In another embodiment,the data is generated by an application. In other embodiments, the datais received, retrieved, forwarded, etc.

Common examples of data structures may include without limitation:tables, look-up-tables, arrays, two-dimensional arrays, hash tables,linked lists, records, databases, objects, etc. More generally, datastructures are a collection of data values, metadata (data about thedata) and/or their corresponding relationships and/or functions.

As used herein, the term “pool” refers to a collection of dedicatedresources that are kept ready to use, rather than allocated for use andthen released afterwards. While the present disclosure is directed to a“pool” based data transfers, artisans of ordinary skill in the relatedarts will appreciate that other mechanisms may be substituted withequivalent success. For example, other data transfers may use e.g.,first-in-first-out buffers (FIFO), direct memory access (DMA), circularbuffers, and/or any other data structure based mechanism. Still otherdata transfer mechanisms may incorporate logic and/or other signalingcomponents. For example, data transfers may be based on registers, shiftregisters, and/or other physical components.

As a brief aside, a nexus is a logical entity that receives data streamsfrom sources and/or generates data streams for delivery to destinations.Nexuses may perform a variety of different functions, includingaggregation, division, combination, and/or other data stream processingtechniques. In one exemplary embodiment, a nexus handles data flows of aspecific technology. For example, a nexus (or flow switch) may handlenetwork communications via TCP/UDP/IP. In other examples, a nexus mayhandle all inter-user space processes (e.g., upipes) and/or alluser-kernel space pipes (e.g., kpipes).

In one exemplary embodiment, the user space application includes a userspace networking stack. Each user space application creates an instanceof a user space networking stack. Each user space networking stackincludes a TCP/IP networking stack, which processes data for TCP/IPheaders to be transmitted with the payload. Other examples of anetworking stack include without limitation: HTTP stack, Ethernet stack,and IEEE 802.3u stack, which may reside in a kernel space. As usedherein, a networking stack refers to a software implementation of acomputer networking protocol suite or protocol family. A networkingstack processes data so that it can communicated between two entitieswithin a network at various different layers. Examples of networkprotocols include without limitation: HTTP, TCP, IP, Ethernet, and IEEE802.3u, and examples of layers include without limitation: applicationlayer, transport layer, network layer, link layer, and physical layer.

In another embodiment, the first application is a hardware driverapplication receiving data for a user space application. Common examplesof a hardware driver include without limitation: network interface cards(NICs), Wi-Fi NICs, Bluetooth drivers, USB drivers, and/or other wiredperipherals. Each driver allows data communication between differentsystems of software and/or hardware. As used herein, the term “driver”refers to an interface device that can be accessed by software tointeract with a piece of hardware in carrying out a functionality.

At step 1204 of the method 1200, the read data is modified in accordancewith a virtualized function. In one exemplary embodiment, network datamay require a checksum to maintain data integrity. In one exemplaryembodiment, checksums are word-based; thus, the flow-switch may read thedata one word at a time from the first pool of memory resources. Eachword of data is summed with a running value. In the transmit direction,the computed checksum is inserted into the protocol headers; in thereceive direction, the computed checksum is attached to the metadata forthe user-space stack to validate against the checksum received in theprotocol header. Checksum data can be used by the recipient of the datato verify that all the data was properly received (i.e., the recipientshould be able to calculate a matching checksum value). The foregoingword-based checksum is purely illustrative; different formats mayperform checksums on a byte basis, block basis, and/or any other datasize increment.

More generally, many forms of forward error correction, error recovery,and/or error detection require running calculations on the data stream.For example, cyclic redundancy checks use a running result from apolynomial shift register to provide a check value. Parity bits areoften generated by counting a number of ones (or zeros) in a datastream. Other common forms of error correction include e.g., Turbocoding, Viterbi coding, interleaving/de-interleaving, Reed-Solomonencoding, etc. Artisans of ordinary skill in the related arts, given thecontents of the present disclosure, may substitute any number of errordetection/correction techniques with equivalent success.

Another common example of data modification is line coding. For example,8B10B line coding converts 8-bits of digital data to 10-bits of linecoded data for transmission over physical media. The 8B10B word providesseveral benefits (e.g., improved error recovery and noise immunity overphysical media). Other examples of line coding include e.g., 32B/33B,64B/66B, 128B/130B, Transition Minimized Differential Signaling (TMDS),Non-return to Zero (NRZ), Manchester encoding, etc.

In other such examples, audio and/or video data may intersperse certaincontrol signaling within the data path; for example, many video formatsrequire e.g., blanking interval insertion, clock recovery data, and/orother control information. More generally, artisans of ordinary skill inthe related arts will readily appreciate that the data being read may bemodified in a manner so as to assist, in whole or part, driver logicand/or hardware functionality.

While the present discussion is presented in the context of softwareoperation, hardware accelerators may be substituted with equal success.For example, a DMA type component may enable copy-checksum transfer.Examples of DMA type components include without limitation: disk drivecontrollers, graphic cards, network cards, sound cards, etc.

At step 1206 of the method 1200, the data is stored in a second pool ofresources by the flow-switch. The stored data can then be read from thesecond pool by a second application.

In one exemplary embodiment, the second application is a user spaceapplication transmitting data via a HW driver. In another exemplaryembodiment, the second application is a HW driver application receivingdata for a user space application.

In yet another exemplary embodiment, the driver application interfaceswith hardware. In one variant, the hardware may be a commodity componentthat does not have the aforementioned functionality. For example, as acommodity component, the hardware may not have the aforementionedfunctionality and must rely on the virtualization of the functionalityby the flow-switch or on the functionality as implemented by and enabledon e.g., the user space networking stack. In other variants, thehardware may include the functionality, but be allowed to operate in adisabled mode. For example, a modem as described herein may be capableof the aforementioned checksum offloading. However, the hardware maystill operate with the checksum offloading disabled and rely on thevirtualized checksum functionality as implemented in the system asdescribed herein. The disabled mode allows avoidance of unnecessary useof resources and thus more efficient mode of operation.

Still other implementations will be made apparent to those of ordinaryskill given the contents of the present disclosure.

Improvements and Changes to Memory Handling

The following discussion is directed to the salient distinctions in userspace memory management that substantially improves and/or enables theforegoing discussions regarding the aforementioned exemplary channelinput/output (I/O), as well as the user space networking stacks. Moredirectly, as is discussed in greater detail hereinafter, the variousaforementioned methods and techniques are substantially improved whenconsidered in combination with novel memory architectures.

As a brief aside, traditional networking stack architectures weredesigned for server based applications; consequently, traditional stackarchitectures were expected to scale quickly to accommodate many networkconnections, in an environment where memory was not a primaryconsideration. Thus, the existing network stack implementations are notoptimized for memory constrained platforms, such as consumer devices(e.g., MacBook®, iMac®, iPad®, and iPhone®, manufactured by the Assigneehereof). For example, as discussed in greater detail hereinafter,traditional stack architectures could be overly conservative bybuffering too many packets and/or be not aggressive enough in pruningidle and/or stranded memory.

For example, consider a traditional network stack that e.g., browses toa webpage and attempts to open network sockets for each of the webpageassets. Initially, the traditional network stack will open as manynetwork sockets as it can, in parallel, under the assumption that thenetwork connection is the primary performance bottleneck. Moreover, theaforementioned “mbufs” (memory buffers) are allocated to the maximumamount of memory space, as a conservative hedge. Empirical evidenceshows that the excessive number of mbufs that are created and wastedbased on the network socket instances results in overall memory pressureon the system, thereby significantly affecting overall deviceperformance. In other words, the open sockets with an excessivelyconservative memory allocation, opened in parallel, consumes too much ofthe overall system memory.

In contrast, exemplary embodiments of the present disclosuresignificantly improve device operation by implementing stringent memorylimitations and relying on user space memory management techniques. Forexample, rather than opening multiple network sockets for each webpageasset, the user space network stack will only open a single channelresource and internally juggle the memory allocation for downloading theassets. More directly, each of the user space networking stacks isconstrained to a fixed memory allocation for its channel. For example,each channel is limited to a few megabytes (MB) of memory. Unlike kernelspace which can scale memory to support all of its user threads, memorymanagement techniques for each user space applications are designed tofrugally prune unused memory to stay within its own memory allocation.

Furthermore, traditional networking stack architectures suffer frombuffer fragmentation that is directly attributable to the aforementionedmbuf divide/copy/move operations. As previously noted, a page of memoryis divided into mbufs. Packets should ideally be stored in contiguousmbufs, however as packets are constantly added, removed, and/or changedin size, the contiguous memory space is consumed, leaving only smallholes in which to place new data. As new packets are added and/ordeleted, the memory necessarily becomes fragmented into non-contiguousdata blocks. Unfortunately, reclaiming a memory page requires that everymbuf of the page is deallocated.

In contrast, exemplary embodiments of the present disclosuresignificantly improve device operation because channel allocations arecontiguously assigned for a user space application (e.g., the user spacenetwork stack), and reaped back in its entirety. Consequently, the userspace memory allocation and deallocation process inherently preventspersistent fragments of memory allocation. Thus, user space networkingarchitectures do not suffer from performance loss due to memoryfragmentation.

The following discussions present additional optimizations made possibleby user space memory management which can be leveraged to furtherimprove user space networking stacks functionality.

Channel Defunct (“Reaping”)

In the following discussions, the exemplary kernel allocates memorychunks in so-called “regions”; each region is further subdivided into“segments”; each segment is divided into objects. The collection ofregions for an exemplary channel I/O is called an “arena.” Moredirectly, each of the exemplary channel I/O has its own arena. While thefollowing discussion is presented within the context of a particularmemory allocation scheme, artisans of ordinary skill in the related artswill readily appreciate that other schemes may be substituted withequivalent success.

Within this hierarchical memory structure, each of the substituenttiered memory allocations are traversed for access, allocation, and/ordeallocation. For example, if a process requests an object, then thekernel allocates a region, segment, and object, before returning anobject to the process. Similarly, in order to free a segment, all of theobjects of the segment must be freed. In order to free a region, all ofits segments must be freed.

During normal operation, there are certain circumstances where a userspace application is suspended or otherwise sent into an inactive state.Within the context of the aforementioned user space memory managementcleanup, a backgrounded application is usually given some amount of timein which to “gracefully” exit or resume; thereafter, the kernel willattempt to free the underlying memory allocation. This can posesignificant issues for user space networking stacks which may beoperating within shared memory space, and which may still holdreferences onto its allocated data objects. In particular, if thenetworking process was backgrounded and the kernel frees its objects(the kernel will forcibly free the memory objects), then the user spacenetworking stack may not correctly recover.

Traditional networking stacks are managed directly by the kernel, andthus are not subject to these cross domain hazards and/or have kernelbased solutions for recovery (e.g., the kernel closes the in-kernelstack process, and notifies the application via the socket (the userspace application would close out the socket)).

In contrast, exemplary embodiments of the kernel marks the channelallocation as “defunct”. In one such implementation, the defunct arenaof the channel is mapped to an anonymous (zero-filled) memory page (orother similarly recognizable invalid content). Thereafter, the kernelfrees its underlying memory structures. When the task is resumed, theuser space networking application will attempt to access the samechannel using the redirected memory map, the access will succeed butwill yield all-zero data. The subsystem in user space will then detectthat the channel is defunct, and inform the user space stack layer aboutit. The user space application detects the invalid content andgracefully handles the errors. For example, the user space networkingstack may close out the network connection and/or terminate itself.

As used herein, the term “redirect” refers to a memory mapping that isdereferenced to a memory location. The memory location may be a valid oran invalid memory region.

In other words:

Freeing networking memory associated with a process when it isbackgrounded. When using a shared memory interface defuncting the memorygets very challenging as the process may be in the middle of a logicworking on the shared memory data when it is backgrounded.

One solution redirects the shared memory mapping of the task so thatthey are backed with anonymous (zero-filled) pages and frees theunderlying memory. When the task is resumed, the user space sharedmemory accessor functions (Libsyscall wrappers) have the logic to detectdefuncted state of the shared memory and gracefully handle errors due todata inconsistencies.

Reaping Based on Heuristics

In-kernel network stacks juggle many different threads simultaneouslyfor the entire system, each having different levels of priority. As aresult, historically, it has been unfeasible to identify particularthreads which are idle or underutilized in a traditional network stack.As a result in-kernel resource management “inline in data path” is proneto lock ordering issues and/or holding expensive exclusive locks.

In contrast, a user space network stack only services a single thread,thus a user space networking stack can easily identify if its resourcesare being squandered. Moreover, even where the user space networkingstack incorrectly reaps its resources, the resulting performance loss isisolated to itself; it will not affect other stacks or drivers.

In one exemplary embodiment, a user networking stack monitors a numberof parameters and/or other heuristics to determine whether or not theconnection is idle. Common examples of such heuristics may include timealive, time waiting, buffering data, last time active, historic use,predicted use, and/or any other predictive or probabilistic scheme toidentify when to reap a process.

It is appreciated that aggressive reaping methods may be used to improveperformance up to a point, thereafter overly aggressive reaping may bedetrimental. More directly, from a holistic system view, each of theuser space networking stacks is associated with its own unique memorypools per channel and/or per device driver. Each of which has differentperformance requirements. For example, a cellular driver and an Ethernetdriver has different requirements and/or costs of loss (e.g., Ethernettypically has higher runtime data rates, and thus a larger memory pool).Consequently, the aggressiveness or conservativeness of process reapingmay be fine-tuned based on the type of application or driver and/orother application specific criteria.

In other words:

Efficiently and aggressively pruning and purging of idle resources isneeded. Managing resources inline in data path may also be prone to lockordering issues or holding expensive exclusive locks.

Various disclosed embodiments include mechanisms which can detect idleresources and can offload pruning and purging of these resources in adeferred context.

Daemon Specific Considerations

As used herein, the term “daemon” refers to a special process that runswithin user space, just like other user space applications. However,daemons run in the background and do not require any user interaction atall. Moreover, 3^(rd) party developers also do not have control andcannot create system daemons. Only the Assignee hereof can createdaemons for its own systems; they are a special, privileged type ofprocesses that 3^(rd) party developers cannot deploy. Daemons are neversuspended, and are usually limited to a fixed memory allocation.

Under some circumstances, a networking daemon can accidentally leakmemory or cause other problems. For reasons previously articulatedabove, identifying rogue threads in traditional networking stacks wasunfeasible. However, within the context of the present disclosure,assigning networking daemons to their own “user” space networking stack(even though the daemon is not really a user process) can greatlymitigate daemon errors and improve daemon recovery, provided that thedaemon's networking stack is appropriately handled.

In one exemplary embodiment, in order to ensure that a daemon iscorrectly operating, the kernel sets a “high water mark” for the networkdaemon's thread (an amount of data that a daemon cannot exceed duringnormal usage.) Subsequently, if the daemon's thread leaks memory, theprocess can be terminated and/or restarted.

Unfortunately, a simple “high water mark” can still pose problems fornetworking daemons. In particular, networking daemon processes may beinfrequent and generally operate on TCP. TCP holds onto packet buffersuntil the application (here, the daemon) is ready to consume thepackets. Consequently, if the network daemon does not read data for longperiods of time (which is relatively common for a network daemon), or ifthe network sends a large batch of packets (e.g., due to bad networkconditions where TCP segments arrive out-of-order; TCP stores thealready-received segments in its reassembly queue until the missingsegments arrive), then the TCP flow could be associated with a largeamount of memory. The TCP protocol in this case runs within the daemon,the memory associated with the TCP flow increases the physical memoryfootprint of the daemon. This increased physical memory footprint couldexceed a daemon's allowable high water mark. Consequently, the daemoncould be unintentionally targeted by the system for termination.

To these ends, various embodiments of the networking daemon stackinclude an efficient memory management module that keeps track of thememory consumed by the network protocols (e.g., TCP buffering)associated with the daemon. Depending on memory usage, the memorymanagement module indicates to the kernel whether there is active workperformed by the protocols on behalf of the application. Specifically,if the memory usage increases a certain threshold set by the memorymanagement module, then the module indicates to the kernel that activework is being performed by the network protocols on behalf of theapplication. This lets the kernel know that the increased memory usageby the daemon is expected. Once the memory associated with the buffersare returned to the memory management module, the module indicates tothe system that the active work is complete. This prevents the systemfrom targeting processes that consume more memory while doing activework.

More generally, while the foregoing process is described within thecontext of a network daemon, artisans of ordinary skill in the relatedarts will readily appreciate, given the contents of the presentdisclosure, that substantially similar techniques could be used on otherapplications (e.g., slow responding or infrequent) and/or otherprotocols with longer queuing intervals.

In other words:

The TCP protocol maintains a sequence in packets and provides the datato the application in that order. So, TCP holds onto received packets,in cases where packets are received out of order or until theapplication is ready to read. If the application does not read data forlong or if the network sends a large batch of packets, the TCP flowcould be holding onto a large amount of memory for a while. This leadsto increased physical memory footprint for a process and the processcould be targeted by the system for termination.

In one embodiment, an efficient memory management module keeps track ofthe memory consumed by the network protocols and depending on memoryusage indicates to the system that active work is being performed by theprotocols on behalf of the application. Once the buffers are returned tothe memory management module, the module indicates to the system thatthe active work is complete. This prevents the system from targetingprocesses that consume more memory while doing active work.

Other TCP Specific Considerations

TCP presents specific problems for traditional “defunct” Channel I/O(see e.g., Channel Defunct (“Reaping”)). As a brief aside, during normalTCP operation, TCP packets are received and re-ordered. Thereafter, theTCP packet headers are checked in order to ensure that the re-orderedTCP packets are correctly received.

In other words, since the TCP headers are stored in the channel space,defuncting a channel I/O may result in data inconsistencies (and/orunknown states) in the TCP check logic (due to the memory mappingredirection to zero-filled pages); instead of triggering a gracefultermination, the user space networking stack could triggerretransmission attempts or other undesirable data handling.

In one exemplary embodiment, in order to avoid data inconsistency issueswhen a channel is defunct during processing of TCP packet, the userspace networking stack copies the original TCP header into heap memory(which is not part of the channel allocation). Once TCP processingbegins, the user networking stack can use the copy of the TCP header tomake decisions (thereby preventing undesirable behavior).

Additionally, in some cases, various embodiments also prevent datacorruption to higher layers above TCP. For example, if the data containszeroes due to memory redirection, then the data is not forwarded on.Instead, after the copy step (from channel buffer to applicationbuffer), user space TCP checks to see if the channel is defunct and ifso indicates that the connection is disconnected (so that data can bethrown away).

More generally, while the foregoing process is described within thecontext of TCP, artisans of ordinary skill in the related arts willreadily appreciate, given the contents of the present disclosure, thatsubstantially similar techniques could be used on any dedicated logic(which would not recognize the aforementioned invalid data).

In other words:

Using the channel memory after defunct could lead data inconsistenciesin user space TCP.

To avoid data inconsistency issues when a channel is defunct duringprocessing of TCP packet, various embodiments make a shadow copy of theoriginal TCP header in heap memory. Once TCP processing begins, it usesthe copy of the TCP header to make decisions which prevents anyinconsistency or data corruption. The validation is done prior tohanding off the payload data to the layer above TCP, as well as withinthe TCP input processing paths.

Flow Manager

Unlike traditional networking stacks which only support a singleprotocol stack instance, the user space networking stacks each supportseparate instances of a protocol stack. Each stack is further associatedwith a set of flows. Thus, as user stack instances are created anddestroyed, there is a corresponding need to manage flow life-cycleswithin the nexus.

In one exemplary embodiment, the nexus includes a flow manager thatcouples to multiple communication stacks in order to manage flowlife-cycles. In one such embodiment, the flow manager is a logicalentity that accepts calls to create, destroy, defunct, and/or otherwisemanipulate flows. In some variants, the flow manager may alsoautomatically shut down flows when the flow owner process exits.

As previously noted, the nexus may interact with both non-kernel andin-kernel networking stacks to e.g., enable legacy stack operationand/or other daemon based networking stacks. For example, the flowmanager may interoperate with both legacy sockets as well as user spacenetwork extensions. Moreover, it is further appreciated thatapplications that reference higher layer APIs may not know whether it isbest served by a socket, network extension or other protocol stackinstance. Selection of the appropriate protocol stack be managed by thelibnetcore (e.g., either a socket to a legacy stack or a networkextension to a user space networking stack, etc.)

In one exemplary embodiment, the flow manager works with the NetworkExtension Control Policy (NECP) module that interfaces with the userspace network stack (which resides within a larger library for networkinterfaces (libnetcore)). For example, a higher API call to libnetcore,will communicate with the NECP, which communicates with the flow managerto obtain a flow for e.g., a socket or user space networking stackinstance.

In other words: There is a need to manage flow lifecycles e.g. flowcreation and destruction at the flow-switch, which interfaces withcalls/events from other components.

Flow manager is the entity that provides the interface. The flow manageraccepts calls to create/destroy/defunct flows. It also shuts down flowswhen the flow owner process exits. This allows proper clean-ups to bedone regardless of how the process terminates.

Flow Purging, and Defuncting

As a brief aside, in traditional networking stack technologies, everyconnection is associated with a socket (a socket is file descriptor).Thus, each socket is independent of other sockets. For example, if atraditional networking stack has a number of existing sockets and tries(but fails) to open another socket, the existing sockets are unaffected.Additionally, the connection is directly tied to the socket (i.e., 1:1mapping).

In contrast, various embodiments of the present disclosure are directedto channel I/O which is a single file descriptor that may encompassmultiple flows (1:N mapping). Consequently, opening and closing achannel should be performed as a separate operation from opening andclosing a flow. More directly, within the context of the kernel'spurging/defuncting operations, each channel instance and flow entryobject is separate and distinct, they have different life-cycles and canbe purged and defuncted separately.

In some cases, user space TCP/IP stacks might keep flow registrationentries active without diligently closing them; this results in thekernel holding onto the flow's associated data structures, even for deadflows (like a closed TCP flow).

In one exemplary embodiment of the present disclosure, a “flow tracker”passively updates flow states (via packet inspection, which monitors butdoes not affect the flow state.). In one such variant, the flow trackertries to determine the state of the flow based on the packet sniffingthe flow. If the flow tracker believes that the flow is dead, then theflow tracker can update the flow registration accordingly.

In one exemplary embodiment, the flow-switch actively (via a specifickernel thread for removing inactive flows) scans through all flows andfinds dead flows to suspend. In one such variant, the flow-switch canread updated entries by the flow tracker. In other variants, theflow-switch may actively test one or more flows for suspension. In somevariants, a daemon (assertiond) runs in the background to defunct flowswhen a process is suspended.

In other words:

User space TCP/IP stacks might be holding flow registration withoutdiligently closing them, thus causing the kernel to keep data structuresaround, even for dead flows (like closed TCP flows).

In one embodiment, 1. The flow tracker passively updates flow state; 2.flow-switch actively scans through all flows and find those dead flowsand close them; 3. orthogonally during defunct, assertiond calls in todefunct flows when process goes suspended.

Flow Control and Advisory

As previously noted, the unique shared memory of the channel I/O andflows (when compared to prior art 1:1 socket based solutions) requiresdifferent methods for bandwidth sharing. Consequently, in one exemplaryembodiment of the present disclosure, the flow manager includes amechanism to moderate flow control for each flow so as to efficientlyuse the overall channel for the user space networking stack.

In one such implementation, the user space network stack infrastructureprovides a packet I/O mechanism to user space stack. Additionally, thepacket I/O mechanism includes Active Queue Management (AQM)functionality for the flows associated with the user space networkstack. AQM culls packets to ensure that each flow does not approach itsmaximum size (i.e., to prevent a single flow from dominating the sharednetwork interface bandwidth). Moreover, since packet culling may requireremoving “good” packets, the AQM module trades off overall channelperformance for each individual flow performance. In some cases, a flowmay be decimated for the benefit of the channel, or conversely theoverall channel efficiency may be reduced to benefit a flow.

As a related aspect, the unique shared memory of the channel I/O andflows also requires different schemes for efficiently providing flowadvisory feedback to user space stacks. More directly, the variousadvisory information for the flows of the channel are separate from thechannel's overall performance.

As noted above, the user space network stack infrastructure may supporta packet I/O mechanism that includes Active Queue Management (AQM)functionality for the flows associated with the user space networkstack. In one such variant, the AQM functionality in the user spacenetwork stack utilizes a kernel event mechanism with a specific type toperform flow advisory reporting (e.g., that a flow has started, stopped,etc.)

As a brief aside, a flow advisory on a connection is received from AQMwhen one of the following two conditions is true: 1. the send rate of aTCP connection increases above the bandwidth supported on the link, or2. the available bandwidth on a wireless link, which is the first hop,from the device goes down.

As a brief aside, flow advisory conditions present problems becausesending more packets will accumulate packets in the interface queue andwill increase the latency experienced by the application. Otherwise, itmight cause packet drops which will reduce the performance because theTCP sender will have to retransmit those packets. By using theflow-advisory mechanism, the TCP senders can adapt to the bandwidthavailable without seeing any packet loss or any loss of performance. Theinterface queue will never drop a TCP packet but it will only send aflow advisory to the connection. Because of this mechanism, buffering indevice drivers was reduced by a significant amount resulting in improvedlatency for all TCP connections on the device.

In other words:

In a user space TCP/IP stack architecture, the stack instance and thenetwork driver are operating in different domains (user space & kernelspace). An efficient mechanism is needed for the user space stack todetermine the admissibility state of a given flow in the stack instance.

In one embodiment, user space network stack infrastructure channelsprovide a flow advisory table in shared memory which is updated by thekernel and consulted by the user space stack to flow control a givenflow. In essence, this table provides admission control information tothe user space stack.

In a user space TCP/IP stack architecture, the stack instance and thenetwork driver are operating in different domains (user space & kernelspace). An efficient mechanism is needed to signal the user space stackfrom kernel space to “flow control” or “resume” a given flow in thestack instance.

In one embodiment, user space network stack infrastructure channelsutilize kernel event mechanism with a specific type to indicate the userspace stack about any updates to the flow advisory state in kernel whichis reflected in the “flow advisory table” maintained in shared memory.Each row in the table represents information about the flow, as well asthe advisory state (e.g. flow-controlled, etc.)

User Space Networking Stack AQM Optimizations

Existing implementations of AQM enable the network to provide AQM flowcontrol and advisory information to the in-kernel stack. However, underthe traditional networking paradigm, the in-kernel stack is unaware ofthe applications associated with the data. In contrast, AQM flow controland advisory information may be further tailored and improve user spacenetworking stacks because the user space network can quickly identifywhich flows should be culled (or preserved) on the basis of applicationconsiderations.

In one exemplary embodiment, the user space networking stack canimplement AQM to prevent buffer bloat conditions intelligently based onwhich flows should be preserved and/or which flows can be culled. Moredirectly, by intelligently selecting AQM culling based on applicationconsiderations, the user space networking stack can achieve the benefitsof both AQM as well as flow priorities.

Moreover, AQM in the uplink direction can also be improved. In someembodiments, the user space network stack can further tailor flowcontrol and advisory before transmission by checking if the flow isadmissible on the channel prior to the transport layer generatingpackets.

In one embodiment, legacy AQM functionality is preserved for both userspace networking stacks and in-kernel stacks. In one suchimplementation, the in-kernel stack can get synchronous flow advisoryfeedback in context of the send/write operation.

In other words:

A common AQM (Active Queue Management) functionality for a networkinterface hosting multiple and differing stack instances (user spaceprotocol stack and in-kernel protocol stack) is desired. The in-kernelBSD stack uses mbuf packets whereas the user space stack instance usesuser space network stack infrastructure packets. The flow control andadvisory feedback mechanism also differ for these stacks due to theirplacement.

In one variant, the user space network stack infrastructure flow-switchnexus is a common entry point for the in-kernel BSD stack and the userspace stack. The flow-switch nexus handles the different packetdescriptor schemes and converts them to the packet descriptor schemebeing used by the underlying network driver before storing the packetsto the AQM queues. It also implements the appropriate mechanisms toprovide flow control and advisory feedback from the AQM queues to thedifferent stack instances.

Driver Managed Pool Resources

As used herein, “wired” memory refers to memory allocations that arebacked by actual physical memory; in contrast, “purgeable” memory refersto memory allocations that may be either actually present or virtuallypresent (virtually present memory can be recalled from a larger backingmemory, with a cache lookup penalty). Notably, the aforementioned mbufsfor traditional in-kernel operation are wired memory; however, thememory allocations for channel I/O disclosed in the various describedembodiments are generally purgeable.

In some cases, a device driver may require a pool of packet buffers tosupport direct memory access (DMA). In one exemplary embodiment, inorder to support DMA within the shared purgeable memory, the driverdynamically maps into the Input/Output Memory Management Unit (IOMMU) orDMA Address Relocation Table (DART) aperture. In some variants, thedriver managed pool of resources can be controlled by the driver (e.g.,not by the user or kernel process). Various embodiments may furtherallow the pool to be exclusive to the driver, or shared among severaldrivers. Read and write attributes can also be restricted on both thehost and the device side based on the I/O direction.

In other words:

A system global packet buffer pool is suboptimal in terms of resourceallocation, and does not offer the ability to deploy device/driverspecific security policies.

In one exemplary embodiment, a packet buffer pool managed and owned by adriver that can be dedicated for that driver, or shared among severaldrivers is disclosed. The owner of the pool handles notifications todynamically map and unmap the pool's memory from its device IOMMUaperture. This same notification can also wire/un-wire the memory asneeded. Read and write attributes can also be restricted on both thehost and the device side based on the I/O transfer direction for addedsecurity.

Segment-Based IOMMU/DART Mapping Considerations

As previously noted a channel is associated with an arena, that iscomposed of regions, which are further composed of segments. One of theaforementioned benefits of the purgeable memory hierarchy within thecontext of the user space networking stack, is that portions of thememory can be dynamically freed and/or allocated throughout the durationof the stack.

However, certain device drivers require that their memory is wired ondemand e.g., the system memory shared with the hardware device may needto be wired during an I/O operation. Consequently, various embodimentsof the present disclosure can wire down and I/O map a memory segmentwithin the shared channel space. Since looking up the I/O bus addressthrough an Input/Output Memory Management Unit (IOMMU) or DMA AddressRelocation Table (DART) is not cheap, mapping a memory segment also hasthe benefit of being able to quickly derive the I/O bus address for allthe packet buffers within that memory segment based on the single memorysegment lookup.

In one such implementation, the request to map the memory segment usingthe memory management unit (IOMMU, DART, etc.), triggered by aconstructor/destructor call back from the memory segment, can beextended and overridden by each driver via object-oriented subclassingto implement driver specific behavior.

In other words:

A packet buffer is typically smaller than a page size but the IOMMUrequires mappings that are multiples of a page size. Looking up an I/Obus address can also be expensive.

Within the packet pool, use a memory segment which is guaranteed to be amultiple of a page size as the smallest memory unit for I/O mappings.Each memory segment is then divided into several packet buffers. Onlyone I/O bus address lookup is required for all the packet buffers withinthat segment, and this I/O bus address can also be cached within thesegment object.

User Packet Pool

Traditional in-kernel networking stacks are unaware of user spaceapplication requirements, and thus always allocate as much memory aspossible (to accommodate the worst case scenario). However, aspreviously noted, user space networking stacks can be far moreaggressive, and dynamically allocate memory (and/or free memory) on anas needed basis.

Unfortunately, dynamically allocated memory can introduce otherpotential issues. For example, whenever the user application asks for anew memory allocation via a system call, the kernel returns the memoryallocation. This cross domain transition can result in theaforementioned performance issues (due to context switching, etc.). Moredirectly, dynamically allocated memory schemes introduce cross domainsystem calls that must be moderated to enable each process to achievethe maximum possible throughput.

An efficient scheme should dynamically scale up and down the memoryavailable to each process according to the current throughputrequirements. In one such exemplary embodiment, the user packet pooluses an efficient packet I/O mechanism to move memory buffers acrosskernel-user boundary and utilizes channel sync statistics to dynamicallyscale the amount of memory available to each channel.

In one such implementation, dynamic memory allocations are performed viaa set of rings. A first “alloc” ring is used to store packet requests, asecond “free” ring is used to store memory allocations that should befreed. During operation, the user space stack requests to allocate somepackets via a single call, and the kernel in turn does the actual objectallocations and attaches the objects to the alloc ring. Then when theuser space stack is done with the objects, it attaches them to the freering, and notifies the kernel via a single call. In turn, the kernelwill free each object that is returned via the free ring.

By using a ring based transfer mechanism, the user and kernel processcan “batch” multiple allocation and deallocation operations together(e.g., at packet transfers); more directly, this allows the kernel toamortize system call cost over a time (e.g., packet calls).

In other words:

When network stack processing is moved into a process context, it is achallenge to allocate enough memory efficiently to enable each processto be able to achieve the maximum possible throughput.

In one embodiment, an efficient scheme is needed to be able todynamically scale up and down the memory available to each processaccording to the current throughput requirements is disclosed. Userpacket pool is a mechanism to achieve this, it tries to reuse theefficient packet I/O mechanism to move memory buffers across kernel-userboundary and utilizes channel sync stats to dynamically scale the amountof memory available to each channel.

Dynamic Sizing for Flow-Switch Ports

As previously noted, the unique shared memory of the channel I/O andflows (when compared to prior art 1:1 socket based solutions) can befurther leveraged to maximize flow-switch operation. As a brief aside,even though the channel I/O persists, each of the flow-switch ports candynamically be created or removed. Thus, the flow-switch port canexperience “fragmentation” effects over time. Even though thesefragmentation effects disappear once the channel itself is closed,excessive fragmenting can impact the memory usage and lookup efficiencyfor the duration of the channel lifetime.

In one exemplary embodiment, the flow-switch breaks up port space intochunks and can manage flow-switch ports at a granularity; in this mannerflow ports and their data structures can be grown and shrunk on demand.For example, consider a flow-switch with 100 ports allocated in 10 portchunks. If 95 ports are freed, only ports 91-100 need to be reserved(the rest can be freed).

In other words:

Flow-switch port come and go and could be fragmented over time (largeholes with high port numbers open), without properly managing, thiscould use extra memory for book keeping.

In one embodiment, the flow-switch breaks up port space into small,contiguous chunks and manages ports at that granularity; data structuresare grown and shrunk on demand. This allows us to provide sparse portsusage.

Further Enhancements and Optimizations Metadata Red Zones

Contiguous or adjacent memory objects are prone to inadvertent memorycorruption due to buffer overrun issues. In a shared memory architecturethe ability for consumer of the data to detect these issues in the leastexpensive manner is important.

Various embodiments of the present disclosure include user space networkstack infrastructure data descriptors (also known as packet or quantum)that have a metadata preamble placed at the beginning of the object.This metadata preamble is used to detect any inadvertent overwrite ofthe metadata. Each metadata object will have a unique red zone patternwhich, in one such variant, is the XOR of a red zone cookie and theoffset of the metadata object in the object's memory region. Red zonecookies are initialized with random numbers on an OS boot. In the eventthe kernel detects a corruption, the user space process associated withthe channel is terminated to prevent further damages.

Object Validations and Sanitations

Using packet descriptor memory shared between user-space and kernel isprone many security vulnerabilities. An efficient method to validate andsanitize these objects during user to kernel handoff is needed.

In one such embodiment, a user space network stack infrastructurearchitecture maintains a mirrored copy of the packet descriptor memorywhich is accessible only from the kernel. During packet handoff fromuser-space to kernel, the user accessible descriptor is validated(against the kernel copy) for any semantic issues and the sanitized datais copied to the kernel mapped descriptor.

Nexus Port Binding to PID, Key Blob, Binary UUID

Access control on user space network stack infrastructure nexus ports isnecessary to prevent unauthorized clients from opening channels.

In one such embodiment, an access control mechanism based on one or moreattributes associated with a channel client, namely process ID, processexecutable's UUID, or key blob is used. A nexus provider chooses toselect one or a combination of those attributes for securing access to aport of a named nexus instance.

SYNflood Detection and Mitigation; RSTflood Detection and Mitigation

Since TCP stack in user-space can be compromised by malicious actorsrunning in the same address space, mechanisms to prevent SYN flood andRST flood attacks originating from the user space stack are needed. Aflood attack is a form of denial-of-service attack in which an attackersends a succession of (SYN/RST) requests to a target's system in anattempt to consume enough server resources to make the systemunresponsive to legitimate traffic

In one embodiment, user space network stack infrastructure flow-switchimplements flow tracking logic which can detect SYN flood and RST floodto prevent these attacks originating from our device. If an attack isdetected, the flow-switch will rate-limit the SYN and RST packets comingfrom the user space stack.

Split TX and RX Packet Pools (Direction Specific DMA Access forSecurity)

Buggy or hostile devices may use PCIe-mapped buffers to attack the host,such as by overwriting the content of in-use buffers, or performingtiming/time-of-use based attacks.

To protect & mitigate these attack surfaces, embodiments of the userspace network stack infrastructure setup map segments to use the minimumpossible memory access permissions on receive and transmit packetbuffers.

Randomized Memory Segment Sizes

Buggy or hostile devices may use PCIe-mapped buffers to attack the host.

To help mitigate this vulnerability, embodiments of the presentdisclosure randomize the PCIe address space mappings, to make itdifficult for an attacker to find vulnerable host-side resources. Tohelp support this security protection, variants of the user spacenetwork stack infrastructure randomize its segment size by randomizingthe number of pages per segment at the time segments are allocated. Theuser space network stack infrastructure may also randomize packet orderwithin a segment, to make it more difficult to correlate packet addressto position within a segment. This could be done via a random slide whenthe segment is first split into packets. For instance, by randomlychoosing which slice of the segment is the first packet, instead ofalways using index 0. Together, these protections make it difficult foran attacker to predict the segment start, end and position from apacket's address. This also makes it difficult for an attacker topredict the location of other segments.

TOCTOU Attack Mitigation

Networking devices such as Wi-Fi chip and baseband could be compromised.Such compromised firmware could launch attack against kernel on theapplication processor using DMA memory. Time of Check to Time of Use(TOCTOU) attacks are caused by changes in a system between the checkingof a condition (such as a security credential) and the use of theresults of that check. For example, a TOCTOU attack could change DMA'edmemory after the kernel has done the sanity check.

In one exemplary embodiment, the nexus makes a kernel only copy beforeaccessing device supplied data to help mitigate this vulnerability, allsubsequent sanity checks and uses on the data are carried out on thekernel only copy. So even if compromised device launches TOCTOU attack,the kernel sees and uses the consistent kernel-only copy that is notaffected as such.

Entitlements to Access Stats and Nexus Operations

Unauthorized applications could infer user and/or application activitiesfrom unprotected user space network stack infrastructure stats, orperform unauthorized nexus operations (like opening a channel to NIC).

In one such embodiment, the system makes entitlements checks forprivileged operations, so that these operations can be done only byprocesses possessing such entitlements e.g., trusted processes.

Leveraging RTT Estimation Data for Bounds Checking

Round Trip Delay Time (RTT) measurement is a critical value for TCPoperations such as retransmission and fast recovery. Since TCP stacksare in user space, they can push down malicious RTT measurement for aparticular route (e.g. to an extremely small value). The effect of thisis that, based on this measurement, other TCP stacks could unnecessarilygenerate many retransmissions to that host. With many devicescompromised as such, it could effectively launch a DoS (denial ofservice) attach against that host.

In one embodiment, the kernel also does its own rough RTT measurementusing flow tracker in the flow-switch to mitigate RTT-type attacks. Toaccept measurements from user space, the kernel does a sanity check withits estimated upper and lower bounds. Only the RTT samples that pass thekernel sanity check may be published to other TCP stack instances.

Do Malicious Stats Detection Before Folding into Trusted Stats. AlwaysUse Trusted Stats from Kernel for Critical Info (e.g. Cellular Usage)

Since TCP/IP stacks are in user space, they can push down maliciousstats counters to fool kernel/other instances of TCP/IP stack. Forinstance, the user TCP could tell that it's using less packets/byteswhen using cellular data, to get around the user visible cellular usageaccounting.

In one embodiment, the nexus also instantiates a shadow kernel-onlystats object in addition to the user space protocol stack instanceshared stats object. The kernel-only stats object stores historicalvalues of the user space protocol stack stats. Before accepting the userspace protocol stack stats, the nexus derives delta of each uTCP statssnapshot with the historical value and does anomaly detection. Also forcritical stats, such as cellular data usage, the user space networkstack infrastructure only relies on trusted flow-switch kernel observedstats.

Preventing IP Address/Port Spoofing

Since TCP/IP stacks are in user space, they can generate packets thatare not allowed from that particular TCP/IP instance, e.g. IP thatdoesn't belong to that host, ports that doesn't belong to that process.

In one embodiment, the flow-switch does a flow 5-tuple lookup in kernelwith the registered flows before packets are transmitted to make surethe sender has the matching 5-tuple registration. Any packets withnon-matching 5-tuple and various other metadata such as flowidentification would be dropped.

Trusted TFO & ECN

TCP in the user space network protocol stack supports both TCP Fast Open(TFO) and Explicit Congestion Notification (ECN). Both TCP options areenabled and/or disabled based on per network heuristics maintained onthe system. This is done to avoid using TFO and ECN on networks thateither do not support these options or blacklist devices if the optionsare present in the TCP header.

During normal operation, the ECN and TFO heuristics are updated eachtime a TCP connection experiences a success or failure when using TFO orECN. If a TCP connection does not experience issues when using theseoptions, new TCP flows would continue to use these TCP options. So, ifthe heuristics is updated with incorrect data, it could lead TFO and ECNbeing enabled on networks that do not support these options.

Within the context of a user space network stack infrastructure, the TCPprotocol stack runs in the user process's context. Each time a userspace TCP connection experiences success or failure while using TFO orECN, it makes a system call into the kernel to update the heuristics.So, a malicious app could indicate a TFO or ECN success on networks thatdo not support TFO or ECN by simply making a system call. This wouldresult in new flows on the system incorrectly using TFO and ECN optionwhich could lead to bad user experience or in worst case scenarios,blacklisting of devices.

All processes can indicate to the system heuristics a failure of TFO orECN. But, in one exemplary embodiment, only processes that are trustedon the system can update the heuristics with TFO or ECN success. Thisprevents malicious apps from incorrectly updating TFO or ECN success onnetworks that do not support these options.

Multi-Buflet Descriptors (Array)

Existing systems allocate memory to hold the largest possible framesize, but jumbo frames need to be supported in a memory efficientmanner.

In one embodiment, a packet can hold an array of buflets, each bufletpoints to a fixed size block of memory allocated from a pool. Thebinding between the buflets and a packet can be formed on demand. Thisscheme allows a packet to have a variable number of buflets depending onthe size of the payload. This also makes it easier to supportscatter-gather style DMA engines by handing it buflets, which areuniform by nature.

Split Metadata and Buffer Management

Exposing packet metadata to the hardware such as Wi-Fi chips andcellular baseband could lead to security vulnerabilities such as ReturnOriented Programming and TOCTOU attacks.

In one exemplary embodiment, the system uses different memory regionsfor the packet metadata and the packet buffers to prevent malicioushardware from accessing the packet metadata. Only the packet buffers areI/O mapped and visible to the device.

User Pipe Dynamic Memory Management Using Sync Stats

Various embodiments of the user pipe nexus provide an efficient IPCbetween user space processes using shared memory. However, the number ofprocesses using IPC on an iOS device can be significant. An efficientmechanism is needed to keep the shared memory usage to minimum withoutcompromising on the data throughput.

In one embodiment, the system maintains a fair estimate of immediatememory usage of user (working set) depending on the recent past usage.User pipe nexus maintains a weighted moving average statistics of memoryused during each sync and keeps adjusting the channel memoryaccordingly.

Purgeable Memory (Compressible and Swappable)

The networking memory requirement on an iOS device can be significant.Existing architectures needs all of the memory to be wired, whichreduces the system's ability to recover under memory pressure as thememory cannot be swapped or compressed.

In one exemplary embodiment, the user space network stack infrastructurearchitecture allocates all memory as purgeable and wires memory ondemand when needed.

Memory Region/Arena: Purpose, Layout, Access Protection, Sharing Model

An efficient and generic mechanism to represent and manage the sharedmemory objects of varying types and sizes which are memory mapped to theuser space and/or kernel space is needed.

In one embodiment, the user space network stack infrastructurearchitecture uses shared memory for efficient packet I/O, networkstatistics and system attributes (sysctl). The user space network stackinfrastructure arena is a generic and efficient mechanism to representthese various types of shared memory subsystems and their backing memorycaches, regions and access protection attributes. Channel schema is arepresentation of the shared memory layout for user space process to beable to efficiently access various channel objects.

Mirrored Memory Regions

To implement security validation and sanitation of shared memory objectson the user-kernel boundary, kernel checks a kernel only copy of theseobjects. Improved methods for allocating and retrieving these objectsare needed.

In one embodiment, the system creates mirrored memory objects whichshare the same region offset as that of the associated object and hencecan be retrieved quickly from the attributes of the associated object.

Flow Classification

In one embodiment, the flow-switch parses various protocol layers in aclassifier approach, e.g. IP/TCP/UDP all at once, when the packet islater on consumed by those protocol stack, they do a duplicate parsingof their header.

In one variant, the user space network stack infrastructure packets havea struct_flow as part of packet metadata which contains most informationthat those layers need and it is carried into BSD/user space, etc. Thecontents of this structure is filled once by the flow-switch, and isimportant for performance (the cost of parsing protocol headers is onlypaid once).

Flow Entries

A mechanism to facilitate efficient packet forwarding within a userspace network stack infrastructure flow-switch is needed.

In one embodiment, packet forwarding based on the entries of a flowtable allows the system to facilitate optimal forwarding data planelogic; in one such variant, multiple network interface nexus are fusedtogether to form a direct conduit for sending packets to one another.

Flow Actions

Flow-switch flow needs to carry action on packets for a given flow.

In one embodiment, the flow defines possible actions that can be appliedto its packet, e.g. forward to a flow-switch port to user space protocolstack, forward to BSD stack, drop, transform, etc. This allows for anefficient way to apply traffic rules without involving separate databaselookups.

Flow Routes

ARPing/routing is still managed by BSD stack, user space network stackinfrastructure flows need to consult BSD stack for information likedefault gateway MAC, etc., which incurs overhead per packet.

User space network stack infrastructure flow route is a cache aroundthose BSD info, such that for user space network stack infrastructureflow packets can find those information within user space network stackinfrastructure context along with flow lookup. The flow route isnotified when related events happen, e.g. route change, ARP expire, tomaintain consistency. The flow routes allow for packets going out of thesystem via User space network stack infrastructure channels to not incurper-packet routing table lookup costs.

Flow Tracker

With user space network stack infrastructure, TCP/IP protocol is in userspace, so kernel loses knowledge of flows as compared to kernel TCP/IPprotocol stack. Such knowledge could be useful to help kernel makingdecisions, e.g. scheduling, resource management, etc.

Flow-switch has a flow tracker that passively tracks flow state/statsduring flow classification and classifier. It provides KPI for othercomponent to query flow states and stats. It also takes some pro-activeactions in cleaning up flows that are deemed to be terminated (by bothends) and not expecting any more data.

Achieving Low Latency for Urgent Packets Using Flow Tracking

Since we explicitly batch packet before delivering to NIC or user space,urgent packets like DNS queries/TCP control packets are unfavorablydelayed.

The flow tracker checks for those packets and do a flush/notify whenseeing them to ensure we deliver them with low latency. This allows itto dynamically adjust the notifications posted to the user space processdepending on the contents of the packets.

Sharing of Packet Pool Among Trusted Ports

By default, Apps are not mutually trusted, so we need to create separatepacket pool for each trust domain, e.g. kernel's device packet pool,per-app packet pool. Cross pool data movement incurs data copy thusoverhead even for mutually trusted ones.

Packet pools can be configured to be shared across process, e.g. betweentwo processes, between kernel and trusted first-party apps. Thus packetmovement doesn't need copying, and this allows for zero-copy datatransfers between any of the entities—should the configuration allowsfor that.

IP Fragments Management. Light Weight Packet Reassemble for Channel (asif Perfect Network Condition)

Flow-switch doesn't own the TCP/IP stack, thus for incoming fragments,which doesn't include full flow info (e.g. first fragment withincomplete TCP header, or later fragments without TCP header), it can'tsuccessfully lookup the flow and send to the correct recipient.

The flow-switch does a lightweight fragment reassembly, where it firstaccumulates all fragments as they come (e.g. using IP address and IP ID,per IP reassembly RFC), then do a single flow lookup, and then deliveryall fragments to user space. To the user space protocol stack point ofview, the flow-switch provides an in-sequence delivery networkabstraction, which makes it easier to handle receiving of fragments inuser space protocol stack.

NetNS for Port Tuple Arbitration

Kernel space BSD stack and the user space protocol stack instances needan efficient mechanism to share and arbitrate the 5-tuple networknamespace, i.e. who gets to use which port on which source address, etc.

User space network stack infrastructure architecture implements a sharednamespace manager (NetNS) to enable sharing and arbitration of thenetwork namespace between the various stack instances.

Offload Control Operations to BSD, e.g. L2 MAC Resolver (ARP/ND) andCallbacks, L3 Route Resolver and Callbacks and ICMP, TCP RST

User space network stack infrastructure does not internally employcontrol protocol handlers such as ARP, ICMP, etc. Without those, userspace network stack infrastructure couldn't function as a network host.

User space network stack infrastructure leverage existing functions inBSD stack to handle those types of packets. Flow-switch, when seeingthose packets, forward them to BSD stack. User space network stackinfrastructure then registers callbacks for events from those BSDstacks, as well as query information for its flow management, etc.

System-Wide Sysctl Via Shared Memory (RO)

In User space network stack infrastructure, IP/TCP stack stays in theuser process and is instantiated per process since user process issegregated from another. User space network stack is generallyinitialized with heuristics from previous connections (e.g. (RTO andTFO). Unlike kernel network stack where it's a single shared instance,we need an efficient way to feed such initialization information to eachprocess.

User space network stack infrastructure implements a system-wide sysctlshared memory region. 1. It is a system-wide memory region shared by allprocesses to minimize memory usage. 2. It is controllable by user viasysctl command to allow easy tuning. 3. It is readable and controllableby kernel network stack if needed. 4. For information that is only readby user space during initialization, after a change made, it can bereflected to newly instantiated user space network stack. 5. Forinformation that is read during runtime, it can be pickup by bothexisting and newly instantiated user space network stack.

Leverage Shared Memory for User Space Stack. Per Stack Stats and AlsoPer Flow Stats.

The user space protocol stack need to publish stats/knowledge for kernelstack or other instances of user space protocol stack efficiently. It'scostly to pass those information across user space/kernel boundary.

Along with memory mapped packet/buffer, the nexus also create a sharedmemory region for stats objects, the kernel would retrieve the statsefficiently using directly memory reference, either periodically orbased on event.

User Space Network Stack Statistics Preservation (Fold into Kernel Statswhen User Space Stack Goes Away).

The user space protocol stack, unlike BSD kernel stack, is instantiatedper process and could come and go with process life cycle, thus userspace protocol stack stats could be lost (when app exits).

When a user space protocol stack instance is destroyed, the user spaceprotocol stack would do a final publication of the stats, either via theaforementioned shared stats memory region or system call, then thekernel would preserve the stats into kernel for accounting anddiagnostic purpose.

Trusted RTT Estimation Based on Passive Observation.

The user space protocol stack needs a feedback mechanism to tell thekernel about its packet processing state, e.g. the processing time ofeach packet as compared to kernel protocol stack.

The flow tracker passively and selectively timestamp TCP packets andcomputes the processing time of RX packets and network latency of TXpackets. This information is kept in the flow entry for bounds checkingand scheduler hint, as well as diagnostic purpose.

NLC v2 (NetEm)

We need to handle various networking conditions, e.g. burst cellulardownlink; \but before that we need a way to simulate it.

There is a NetEm packet scheduler on Rx/TX to simulate those networkingconditions, to simulate hardware features, etc. This is done byleveraging User space network stack infrastructure's built-ininfrastructures, e.g. pre- and post-sync and notify operations on therings/queues.

Compressor & Decompressor

Need to handle header compression and decompression, e.g. 6LoWPAN.

Post-AQM (TX) and Pre-Input (RX) processing hooks for handling packetheader compression and decompression, by leveraging built-in hooksprovided by the data path infrastructure.

Batching Optimizations in Bluetooth Daemon

Reduce the per packet cost for bluetooth communication.

Implement packet batching heuristics in bluetooth user space driver andefficiently move packet batches over user space network stackinfrastructure channels, to/from agent processes, as well as to/fromkernel UART HW driver.

Replacing Socket-Based IPC with Channel: User Pipe (Bluetoothd &Identityservicesd) Kernel Pipe (Bluetoothd & AppleOnboardSerial)

Need an efficient IPC between user space processes. Need efficientpacket I/O interface for user space driver.

User space network stack infrastructure upipe nexus provides anefficient zero-copy packet I/O infrastructure for IPC. It also providesthe ability to send and receive batch of packets helping to amortize thecost of system calls.

User space network stack infrastructure kernel pipe nexus provides afast zero-copy I/O infrastructure for a user space protocol driver tocommunicate to low level in-kernel hardware driver. It also provides theability to send and receive batch of packets helping to amortize thecost of system calls.

Mitigation Thread Dynamic Threshold Table.

An interrupt mitigation scheme helping to reduce the interruptprocessing load while preserving low latency and throughput.

An adaptive interrupt mitigation logic constantly adjusting based onpacket statistics. The adjustment thresholds for the mitigation logiccan be programmed for an interface based on it throughput and linkcharacteristics.

Using RXmitigation and RXring Size to Normalize Packet Flow in BurstyCellular Conditions

Cellular radio conditions could lead to bursty receiving (very highthroughput), leading to packet drop in the stack.

Normalizing the bursty packet load at the network interface by adjustingthe mitigation logic thresholds and input queue size to get a uniformthroughput in bursty scenarios. This is critical for performance, asbursty packet delivery might result in significant packet loss withinthe system itself.

Local RTTw/CLPC Close Loop to Optimize User Stack Latency.

User space network stack infrastructure IP/TCP stack operates in userspace where process generally have lower scheduling priority than thekernel. The time to process incoming data and generates acknowledgementsto server should be kept small, such that server can send next datasooner to reduce the total time of payload transfer. Since user spacenetwork stack doesn't have as high priority as kernel network stack,there needs to be a way to make sure they get enough CPU time.

To do this, we leverage the RTT estimation technique built inflow-switch to track the user stack processing time and form a closeloop along with scheduler and CPU frequency adjuster. The closed loopcontroller gets input from flow-switch local RTT (user space networkstack processing time) estimation, CPU frequency and process schedulingproperties, the output is next CPU frequency and process priority. Theend result is user space network stack gets enough CPU frequency andtime without using extra power and still gets close to kernel networkstack performance.

Submission/Completion Queue Driver Model

Device drivers require a common and flexible queueing model in thedevice driver abstraction layer for packet I/O. The queues hide theunderlying User space network stack infrastructure rings, and alsoreduces the locking contention between the driver work loop and the Userspace network stack infrastructure threads.

Driver facing queues in the IOSkywalkFamily exposes queues based on thesubmission/completion model and are internally backed by User spacenetwork stack infrastructure rings. The submission queues dispatchespackets to the device driver, with the goal of keeping the hardwarering/queue full without overflowing it. Packets are delivered to thedriver in batches to reduce the per-packet cost. Completion queuehandles packets returned by the driver, and also provides feedback tothe respective submission queue to implement flow control.

Receive Submission/Completion Queues that Work with Buffers Instead ofPackets

A mechanism for device drivers to leverage the “multi-bufletdescriptors” idea while still based on the submission/completion queuemodel.

A receive buffer submission queue will dispatch an array of new packetbuffers (not packets) to the driver. The network hardware then fillsthose buffers with received data, with packets that can potentially spanacross multiple buffers. The driver will then notify the buffercompletion queue, who will allocate an array of zero-buflet packets andpresent that packet array to the driver. The driver can then attach oneor more buffers to each packet, and also update the packet metadata.

Driver Doorbell and Refill

Free-standing transmit packets in a driver level ring or queue defeatsthe purpose of the AQM queue. Need a mechanism to prevent this by usinga transmit doorbell and AQM refill.

A doorbell notifies the driver layer when one or more packet isavailable, IOSkywalkFamily will then query the driver for the amount offree space available, in in either packets or bytes. A refill operationis then requested with this free space information which will dequeue abounded amount of packets from the AQM queue and pass them along to thedrivers ring/queue for immediate consumption.

Queue Level Reporting for Network Scheduling

LTE requires the modem to send to the base station a “buffer statusreport” (BSR) indicating the amount of data the modem intends totransmit to the base station. The base station may then send a “grant”to the modem, which entitles the modem to transmit a certain amount ofdata. This mechanism enables the base station to manage networkresources according to policies specified by the network operator. Oncesuch policy may be to prioritize each device according to the amount ofdata it reported in its latest BSR. Another policy may be todeprioritize a device which reports via BSR more data than it actuallytransmits in response to a grant.

It is possible to consider only the amount of data pending in modem TXqueues when sending BSR. For such an implementation, limiting modem TXqueue lengths to reduce bufferbloat also limits the amount of data themodem is able to report in its BSR, which may reduce the sizes of thegrants allocated to the modem by the base station. Existingimplementations enable the host to communicate its interface queuelengths to the modem for inclusion in BSR reports. However withIOSkywalkFamily there is not necessarily any such interface queue.

We added capability of reporting the size of host AQM queues to themodem. This report includes only data which is guaranteed to be sent tothe modem.

Provide Possible Data Transmission Opportunity Enabling EfficientResource Allocation

To save power, radio HW enters a low-power state after a certaininterval of inactivity. As a further optimization, existingimplementations enter this state more quickly (“fast dormancy”) if nodata transmission is imminent. To determine whether data transmission isimminent, existing implementations enable the host to communicate itsinterface queue lengths and socket buffer lengths to the modem. Howeverwith IOSkywalkFamily there is not necessarily any such interface queueor socket buffer.

We added capability of reporting a hint for whether additional datamight be sent to the modem. This report includes data which isguaranteed to be sent to the modem, and also includes data which may ormay not be sent, e.g. data in TCP retransmission queues or data whichmay be cancelled by the app.

Transparent Security (IPsec) Gateway

The current in-kernel implementation of IPsec requires a system rebootfor updates to take effect. This is not desirable for certain use cases.It also suffers from performance issues associated with the BSDnetworking stack design.

User space network stack infrastructure will allow most IPsec componentsto be in user space. Installing new components will only requirerestarting the user space IPsec forwarding daemon. In addition, the userspace transformation plane allows for significantly better performancedue to the elimination of costs associated with in-kernel design andimplementation.

Bridging, Forwarding and Routing

A general purpose networking stack is not ideal forbridging/forwarding/routing because the data path involves traversingmany network stack layers.

User space network stack infrastructure's architecture allows forimplementing a customized user space data path optimized for forwarding.This design allows the user space data path to better leverage hardwarecapabilities such as flow classification and encryption offload forimproved performance. Note that this refers to a user space forwardingplane, rather than in-kernel.

Tapping on any Channel (libpcap/tcpdump)

We need to know what user space stack is sending and receiving in theirchannel efficiently.

We created a special type of nexus called monitor nexus, which taps onthe same channel that it is monitoring, and hooks up libpcap/tcpdump toallow existing API to be used to get those packets. This allows us toprovide a uniform way to inspect traffic going across any channels inthe system.

Test User Space TCP Stack

We needed to run network transport testing tools on user space stacks.The existing tools utilizing BSD sockets APIs no longer work.

We used the monitor nexus and the libpcap changes described above alongwith improved support for libnetcore in packetdrill to test the userspace TCP stack. This allows us to validate that the user space TCPstack is on par with the kernel TCP, in terms of functionalities andcorrectness.

Nexus Statistics (Flow-Switch Stats)/Channel Statistics (Ring Stats/SyncStats)/Flow Stats

In a packet I/O infrastructure when data is moving across multiplelayers, it is important to have visibility into statistics at eachlayer. These statistics can be used for varying needs: Diagnosingissues, Data accounting, memory allocation and purging heuristics.

In the nexus, it counts stats in several tiers, 1. nexus stats whichinclude all packets going through it, including all channels and BSDstack; 2. channel stats which only accounts for packets going throughchannel. It includes ring stats and sync stats, which provides insightlike batch size per sync. 3. flow stats, which provides packets/bytescounter as well as flow states.

Scheduling Hint Added to TCP RTT Each Process Running User Space

TCP protocol on our system uses the round trip time (RTT) learnt fromprevious connections to the same destination. The RTT information foreach TCP connection is stored on the system and is used by each new TCPflow to bootstrap its learning about the network path to thedestination. This helps the new TCP flows recover quickly from packetlosses seen on the network.

With User space network stack infrastructure, TCP runs in each userprocess's context. Each process running user space TCP has a differentscheduling priority. This scheduling priority adds to the delay inresponding to the TCP packets, which has an impact on the observed TCPRTT. So, a lower priority process with higher scheduling delays couldsee a higher TCP RTT to the same destination compared to a higherpriority process.

If the stack use the RTT learnt by the system, a new flow in the higherpriority processes could use the hints from a lower priority process.This could have a bad impact on the TCP connection. So, we need toreduce the impact of one process's observed RTT from impacting the RTTof another process during TCP initialization.

We maintain a per process TCP RTT heuristics, which tracks the TCP RTTfor flows only within that process. So, a new flow opened in a processleverages the learnt TCP RTT only from the same process. This minimizesthe impact of different process scheduling priorities.

It will be recognized that while certain embodiments of the presentdisclosure are described in terms of a specific sequence of steps of amethod, these descriptions are only illustrative of the broader methodsdescribed herein, and may be modified as required by the particularapplication. Certain steps may be rendered unnecessary or optional undercertain circumstances. Additionally, certain steps or functionality maybe added to the disclosed embodiments, or the order of performance oftwo or more steps permuted. All such variations are considered to beencompassed within the disclosure and claimed herein.

While the above detailed description has shown, described, and pointedout novel features as applied to various embodiments, it will beunderstood that various omissions, substitutions, and changes in theform and details of the device or process illustrated may be made bythose skilled in the art without departing from principles describedherein. The foregoing description is of the best mode presentlycontemplated. This description is in no way meant to be limiting, butrather should be taken as illustrative of the general principlesdescribed herein. The scope of the disclosure should be determined withreference to the claims.

What is claimed is:
 1. A method for copy and checksum optimizations withuser space communication stacks, the method comprising: configuring afirst user space application with pass-through checksum functionality;reading data from a first pool of resources associated with the firstuser space application; calculating a checksum value based on the data;and storing the data in a second pool of resources associated with ahardware driver.
 2. The method of claim 1, wherein reading the datacomprises reading a plurality of word segments.
 3. The method of claim2, wherein calculating a checksum value based on the data comprises arunning summation of the plurality of word segments.
 4. The method ofclaim 3, wherein storing the data in the second pool of resourcescomprises storing the checksum value.
 5. The method of claim 4, whereinreading the data from the first pool of resources is performed by akernel space process.
 6. The method of claim 4, wherein calculating thechecksum value is performed by a kernel space process.
 7. The method ofclaim 1, wherein the hardware driver is configured for a networkinterface card.
 8. The method of claim 8, wherein the hardware driverdoes not provide checksum functionality.
 9. A system configured for copyand checksum optimizations with user space communication stacks, thesystem comprising: an application that comprises a user spacecommunication stack; a first pool of dedicated memory resources for theapplication; a second pool of dedicated memory resources for a kernelspace hardware driver; a kernel space flow-switch configured tocopy-checksum data from the first pool of dedicated memory resources tothe second pool of dedicated memory resources; and kernel space logicconfigured to: read data from the first pool of dedicated memoryresources; calculate a checksum value based on the data; and store thedata in the second pool of dedicated memory resources.
 10. The system ofclaim 9, wherein the kernel space hardware driver comprises a networkinterface card.
 11. The system of claim 10, wherein the networkinterface card is configured to transmit IP data.
 12. The system ofclaim 10, wherein the network interface card does not include a checksumfunctionality.
 13. The system of claim 10, wherein the network interfacecard operates without a checksum functionality.
 14. The system of claim13, wherein the user space communication stack operates without thechecksum functionality.
 15. The system of claim 14, wherein the userspace communication stack operates in a pass-through mode.
 16. Thesystem of claim 9, wherein the kernel space logic is configured to readdata from the first pool of dedicated memory resources in word segments.17. The system of claim 16, wherein the kernel space logic is configuredto calculate the checksum from the word segments.
 18. The system ofclaim 9, wherein the kernel space logic is prioritized over user spacelogic.
 19. A non-transitory computer readable apparatus comprising astorage medium having one or more computer programs stored thereon, theone or more computer programs, when executed by a processing apparatus,being configured to: read one word of data from a first pool of memory;calculate a checksum value based on the one word of data; and store theone word of data in a second pool of memory.
 20. The non-transitorycomputer readable apparatus of claim 19, wherein the first pool ofmemory is dedicated to a first application comprising hardware driver,the hardware driver receiving data for a user space networking stack.